DataLakeFileClient Class
A client to interact with the DataLake file, even if the file may not yet exist.
- Inheritance
-
azure.storage.filedatalake._path_client.PathClientDataLakeFileClient
Constructor
DataLakeFileClient(account_url: str, file_system_name: str, file_path: str, credential: str | Dict[str, str] | AzureNamedKeyCredential | AzureSasCredential | TokenCredential | None = None, **kwargs: Any)
Parameters
Name | Description |
---|---|
account_url
Required
|
The URI to the storage account. |
file_system_name
Required
|
The file system for the directory or files. |
file_path
Required
|
The whole file path, so that to interact with a specific file. eg. "{directory}/{subdirectory}/{file}" |
credential
|
The credentials with which to authenticate. This is optional if the account URL already has a SAS token. The value can be a SAS token string, an instance of a AzureSasCredential or AzureNamedKeyCredential from azure.core.credentials, an account shared access key, or an instance of a TokenCredentials class from azure.identity. If the resource URI already contains a SAS token, this will be ignored in favor of an explicit credential
Default value: None
|
Keyword-Only Parameters
Name | Description |
---|---|
api_version
|
The Storage API version to use for requests. Default value is the most recent service version that is compatible with the current SDK. Setting to an older version may result in reduced feature compatibility. |
audience
|
The audience to use when requesting tokens for Azure Active Directory authentication. Only has an effect when credential is of type TokenCredential. The value could be https://storage.azure.com/ (default) or https://.blob.core.windows.net. |
Examples
Creating the DataLakeServiceClient from connection string.
from azure.storage.filedatalake import DataLakeFileClient
DataLakeFileClient.from_connection_string(connection_string, "myfilesystem", "mydirectory", "myfile")
Variables
Name | Description |
---|---|
url
|
The full endpoint URL to the file system, including SAS token if used. |
primary_endpoint
|
The full primary endpoint URL. |
primary_hostname
|
The hostname of the primary endpoint. |
Methods
acquire_lease |
Requests a new lease. If the file or directory does not have an active lease, the DataLake service creates a lease on the file/directory and returns a new lease ID. |
append_data |
Append data to the file. |
close |
This method is to close the sockets opened by the client. It need not be used when using with a context manager. |
create_file |
Create a new file. |
delete_file |
Marks the specified file for deletion. |
download_file |
Downloads a file to the StorageStreamDownloader. The readall() method must be used to read all the content, or readinto() must be used to download the file into a stream. Using chunks() returns an iterator which allows the user to iterate over the content in chunks. |
exists |
Returns True if a file exists and returns False otherwise. |
flush_data |
Commit the previous appended data. |
from_connection_string |
Create DataLakeFileClient from a Connection String. |
get_access_control | |
get_file_properties |
Returns all user-defined metadata, standard HTTP properties, and system properties for the file. It does not return the content of the file. |
query_file |
Enables users to select/project on datalake file data by providing simple query expressions. This operations returns a DataLakeFileQueryReader, users need to use readall() or readinto() to get query data. |
remove_access_control_recursive |
Removes the Access Control on a path and sub-paths. |
rename_file |
Rename the source file. |
set_access_control |
Set the owner, group, permissions, or access control list for a path. |
set_access_control_recursive |
Sets the Access Control on a path and sub-paths. |
set_file_expiry |
Sets the time a file will expire and be deleted. |
set_http_headers |
Sets system properties on the file or directory. If one property is set for the content_settings, all properties will be overridden. |
set_metadata |
Sets one or more user-defined name-value pairs for the specified file system. Each call to this operation replaces all existing metadata attached to the file system. To remove all metadata from the file system, call this operation with no metadata dict. |
update_access_control_recursive |
Modifies the Access Control on a path and sub-paths. |
upload_data |
Upload data to a file. |
acquire_lease
Requests a new lease. If the file or directory does not have an active lease, the DataLake service creates a lease on the file/directory and returns a new lease ID.
acquire_lease(lease_duration: int | None = -1, lease_id: str | None = None, **kwargs) -> DataLakeLeaseClient
Parameters
Name | Description |
---|---|
lease_duration
Required
|
Specifies the duration of the lease, in seconds, or negative one (-1) for a lease that never expires. A non-infinite lease can be between 15 and 60 seconds. A lease duration cannot be changed using renew or change. Default is -1 (infinite lease). |
lease_id
Required
|
Proposed lease ID, in a GUID string format. The DataLake service returns 400 (Invalid request) if the proposed lease ID is not in the correct format. |
Keyword-Only Parameters
Name | Description |
---|---|
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
A DataLakeLeaseClient object, that can be run in a context manager. |
append_data
Append data to the file.
append_data(data: bytes | str | Iterable[AnyStr] | IO[AnyStr], offset: int, length: int | None = None, **kwargs) -> Dict[str, str | datetime | int]
Parameters
Name | Description |
---|---|
data
Required
|
Content to be appended to file |
offset
Required
|
start position of the data to be appended to. |
length
Required
|
Size of the data in bytes. |
Keyword-Only Parameters
Name | Description |
---|---|
flush
|
If true, will commit the data after it is appended. |
validate_content
|
If true, calculates an MD5 hash of the block content. The storage service checks the hash of the content that has arrived with the hash that was sent. This is primarily valuable for detecting bitflips on the wire if using http instead of https as https (the default) will already validate. Note that this MD5 hash is not stored with the file. |
lease_action
|
Literal["acquire", "auto-renew", "release", "acquire-release"]
Used to perform lease operations along with appending data. "acquire" - Acquire a lease. "auto-renew" - Re-new an existing lease. "release" - Release the lease once the operation is complete. Requires flush=True. "acquire-release" - Acquire a lease and release it once the operations is complete. Requires flush=True. |
lease_duration
|
Valid if lease_action is set to "acquire" or "acquire-release". Specifies the duration of the lease, in seconds, or negative one (-1) for a lease that never expires. A non-infinite lease can be between 15 and 60 seconds. A lease duration cannot be changed using renew or change. Default is -1 (infinite lease). |
lease
|
Required if the file has an active lease or if lease_action is set to "acquire" or "acquire-release". If the file has an existing lease, this will be used to access the file. If acquiring a new lease, this will be used as the new lease id. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
cpk
|
Encrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. |
Returns
Type | Description |
---|---|
dict of the response header. |
Examples
Append data to the file.
file_client.append_data(data=file_content[2048:3072], offset=2048, length=1024)
close
This method is to close the sockets opened by the client. It need not be used when using with a context manager.
close() -> None
Keyword-Only Parameters
Name | Description |
---|---|
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
create_file
Create a new file.
create_file(content_settings: ContentSettings | None = None, metadata: Dict[str, str] | None = None, **kwargs) -> Dict[str, str | datetime]
Parameters
Name | Description |
---|---|
content_settings
Required
|
ContentSettings object used to set path properties. |
metadata
Required
|
Name-value pairs associated with the file as metadata. |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
Required if the file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
umask
|
Optional and only valid if Hierarchical Namespace is enabled for the account. When creating a file or directory and the parent folder does not have a default ACL, the umask restricts the permissions of the file or directory to be created. The resulting permission is given by p & ^u, where p is the permission and u is the umask. For example, if p is 0777 and u is 0057, then the resulting permission is 0720. The default permission is 0777 for a directory and 0666 for a file. The default umask is 0027. The umask must be specified in 4-digit octal notation (e.g. 0766). |
owner
|
The owner of the file or directory. |
group
|
The owning group of the file or directory. |
acl
|
Sets POSIX access control rights on files and directories. The value is a comma-separated list of access control entries. Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format "[scope:][type]:[id]:[permissions]". |
lease_id
|
Proposed lease ID, in a GUID string format. The DataLake service returns 400 (Invalid request) if the proposed lease ID is not in the correct format. |
lease_duration
|
Specifies the duration of the lease, in seconds, or negative one (-1) for a lease that never expires. A non-infinite lease can be between 15 and 60 seconds. A lease duration cannot be changed using renew or change. |
expires_on
|
The time to set the file to expiry. If the type of expires_on is an int, expiration time will be set as the number of milliseconds elapsed from creation time. If the type of expires_on is datetime, expiration time will be set absolute to the time provided. If no time zone info is provided, this will be interpreted as UTC. |
permissions
|
Optional and only valid if Hierarchical Namespace is enabled for the account. Sets POSIX access permissions for the file owner, the file owning group, and others. Each class may be granted read, write, or execute permission. The sticky bit is also supported. Both symbolic (rwxrw-rw-) and 4-digit octal notation (e.g. 0766) are supported. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Encrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
encryption_context
|
Specifies the encryption context to set on the file. |
Returns
Type | Description |
---|---|
response dict (Etag and last modified). |
Examples
Create file.
file_client = filesystem_client.get_file_client(file_name)
file_client.create_file()
delete_file
Marks the specified file for deletion.
delete_file(**kwargs) -> None
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
Required if the file has an active lease. Value can be a LeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
None. |
Examples
Delete file.
new_client.delete_file()
download_file
Downloads a file to the StorageStreamDownloader. The readall() method must be used to read all the content, or readinto() must be used to download the file into a stream. Using chunks() returns an iterator which allows the user to iterate over the content in chunks.
download_file(offset: int | None = None, length: int | None = None, **kwargs: Any) -> StorageStreamDownloader
Parameters
Name | Description |
---|---|
offset
Required
|
Start of byte range to use for downloading a section of the file. Must be set if length is provided. |
length
Required
|
Number of bytes to read from the stream. This is optional, but should be supplied for optimal performance. |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
If specified, download only succeeds if the file's lease is active and matches this ID. Required if the file has an active lease. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Decrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. Required if the file was created with a Customer-Provided Key. |
max_concurrency
|
Maximum number of parallel connections to use when transferring the file in chunks. This option does not affect the underlying connection pool, and may require a separate configuration of the connection pool. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. This method may make multiple calls to the service and the timeout will apply to each call individually. |
Returns
Type | Description |
---|---|
A streaming object (StorageStreamDownloader) |
Examples
Return the downloaded data.
download = file_client.download_file()
downloaded_bytes = download.readall()
exists
Returns True if a file exists and returns False otherwise.
exists(**kwargs: Any) -> bool
Keyword-Only Parameters
Name | Description |
---|---|
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
True if a file exists, otherwise returns False. |
flush_data
Commit the previous appended data.
flush_data(offset: int, retain_uncommitted_data: bool | None = False, **kwargs) -> Dict[str, str | datetime]
Parameters
Name | Description |
---|---|
offset
Required
|
offset is equal to the length of the file after commit the previous appended data. |
retain_uncommitted_data
Required
|
Valid only for flush operations. If "true", uncommitted data is retained after the flush operation completes; otherwise, the uncommitted data is deleted after the flush operation. The default is false. Data at offsets less than the specified position are written to the file when flush succeeds, but this optional parameter allows data after the flush position to be retained for a future flush operation. |
Keyword-Only Parameters
Name | Description |
---|---|
content_settings
|
ContentSettings object used to set path properties. |
close
|
Azure Storage Events allow applications to receive notifications when files change. When Azure Storage Events are enabled, a file changed event is raised. This event has a property indicating whether this is the final change to distinguish the difference between an intermediate flush to a file stream and the final close of a file stream. The close query parameter is valid only when the action is "flush" and change notifications are enabled. If the value of close is "true" and the flush operation completes successfully, the service raises a file change notification with a property indicating that this is the final update (the file stream has been closed). If "false" a change notification is raised indicating the file has changed. The default is false. This query parameter is set to true by the Hadoop ABFS driver to indicate that the file stream has been closed." |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
lease_action
|
Literal["acquire", "auto-renew", "release", "acquire-release"]
Used to perform lease operations along with appending data. "acquire" - Acquire a lease. "auto-renew" - Re-new an existing lease. "release" - Release the lease once the operation is complete. "acquire-release" - Acquire a lease and release it once the operations is complete. |
lease_duration
|
Valid if lease_action is set to "acquire" or "acquire-release". Specifies the duration of the lease, in seconds, or negative one (-1) for a lease that never expires. A non-infinite lease can be between 15 and 60 seconds. A lease duration cannot be changed using renew or change. Default is -1 (infinite lease). |
lease
|
Required if the file has an active lease or if lease_action is set to "acquire" or "acquire-release". If the file has an existing lease, this will be used to access the file. If acquiring a new lease, this will be used as the new lease id. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
cpk
|
Encrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. |
Returns
Type | Description |
---|---|
response header in dict |
Examples
Commit the previous appended data.
with open(SOURCE_FILE, "rb") as data:
file_client = file_system_client.get_file_client("myfile")
file_client.create_file()
file_client.append_data(data, 0)
file_client.flush_data(data.tell())
from_connection_string
Create DataLakeFileClient from a Connection String.
from_connection_string(conn_str: str, file_system_name: str, file_path: str, credential: str | Dict[str, str] | AzureNamedKeyCredential | AzureSasCredential | TokenCredential | None = None, **kwargs: Any) -> Self
Parameters
Name | Description |
---|---|
conn_str
Required
|
A connection string to an Azure Storage account. |
file_system_name
Required
|
The name of file system to interact with. |
file_path
Required
|
The whole file path, so that to interact with a specific file. eg. "{directory}/{subdirectory}/{file}" |
credential
|
The credentials with which to authenticate. This is optional if the account URL already has a SAS token, or the connection string already has shared access key values. The value can be a SAS token string, an instance of a AzureSasCredential or AzureNamedKeyCredential from azure.core.credentials, an account shared access key, or an instance of a TokenCredentials class from azure.identity. Credentials provided here will take precedence over those in the connection string. If using an instance of AzureNamedKeyCredential, "name" should be the storage account name, and "key" should be the storage account key. Default value: None
|
Keyword-Only Parameters
Name | Description |
---|---|
audience
|
The audience to use when requesting tokens for Azure Active Directory authentication. Only has an effect when credential is of type TokenCredential. The value could be https://storage.azure.com/ (default) or https://.blob.core.windows.net. |
Returns
Type | Description |
---|---|
A DataLakeFileClient. |
get_access_control
get_access_control(upn: bool | None = None, **kwargs) -> Dict[str, Any]
Parameters
Name | Description |
---|---|
upn
Required
|
Optional. Valid only when Hierarchical Namespace is enabled for the account. If "true", the user identity values returned in the x-ms-owner, x-ms-group, and x-ms-acl response headers will be transformed from Azure Active Directory Object IDs to User Principal Names. If "false", the values will be returned as Azure Active Directory Object IDs. The default value is false. Note that group and application Object IDs are not translated because they do not have unique friendly names. |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
Required if the file/directory has an active lease. Value can be a LeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
response dict containing access control options with no modifications. |
get_file_properties
Returns all user-defined metadata, standard HTTP properties, and system properties for the file. It does not return the content of the file.
get_file_properties(**kwargs: Any) -> FileProperties
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
Required if the directory or file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Decrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. Required if the file was created with a customer-provided key. |
upn
|
If True, the user identity values returned in the x-ms-owner, x-ms-group, and x-ms-acl response headers will be transformed from Azure Active Directory Object IDs to User Principal Names in the owner, group, and acl fields of FileProperties. If False, the values will be returned as Azure Active Directory Object IDs. The default value is False. Note that group and application Object IDs are not translate because they do not have unique friendly names. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
All user-defined metadata, standard HTTP properties, and system properties for the file. |
Examples
Getting the properties for a file.
properties = file_client.get_file_properties()
query_file
Enables users to select/project on datalake file data by providing simple query expressions. This operations returns a DataLakeFileQueryReader, users need to use readall() or readinto() to get query data.
query_file(query_expression: str, **kwargs: Any) -> DataLakeFileQueryReader
Parameters
Name | Description |
---|---|
query_expression
Required
|
Required. a query statement. eg. Select * from DataLakeStorage |
Keyword-Only Parameters
Name | Description |
---|---|
on_error
|
A function to be called on any processing errors returned by the service. |
file_format
|
Optional. Defines the serialization of the data currently stored in the file. The default is to treat the file data as CSV data formatted in the default dialect. This can be overridden with a custom DelimitedTextDialect, or DelimitedJsonDialect or "ParquetDialect" (passed as a string or enum). These dialects can be passed through their respective classes, the QuickQueryDialect enum or as a string. |
output_format
|
Optional. Defines the output serialization for the data stream. By default the data will be returned as it is represented in the file. By providing an output format, the file data will be reformatted according to that profile. This value can be a DelimitedTextDialect or a DelimitedJsonDialect or ArrowDialect. These dialects can be passed through their respective classes, the QuickQueryDialect enum or as a string. |
lease
|
Required if the file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Decrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. Required if the file was created with a Customer-Provided Key. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
<xref:azure.storage.filedatalake.DataLakeFileQueryReader>
|
A streaming object (DataLakeFileQueryReader) |
Examples
select/project on datalake file data by providing simple query expressions.
errors = []
def on_error(error):
errors.append(error)
# upload the csv file
file_client = datalake_service_client.get_file_client(filesystem_name, "csvfile")
file_client.upload_data(CSV_DATA, overwrite=True)
# select the second column of the csv file
query_expression = "SELECT _2 from DataLakeStorage"
input_format = DelimitedTextDialect(delimiter=',', quotechar='"', lineterminator='\n', escapechar="", has_header=False)
output_format = DelimitedJsonDialect(delimiter='\n')
reader = file_client.query_file(query_expression, on_error=on_error, file_format=input_format, output_format=output_format)
content = reader.readall()
remove_access_control_recursive
Removes the Access Control on a path and sub-paths.
remove_access_control_recursive(acl: str, **kwargs: Any) -> AccessControlChangeResult
Parameters
Name | Description |
---|---|
acl
Required
|
Removes POSIX access control rights on files and directories. The value is a comma-separated list of access control entries. Each access control entry (ACE) consists of a scope, a type, and a user or group identifier in the format "[scope:][type]:[id]". |
Keyword-Only Parameters
Name | Description |
---|---|
progress_hook
|
<xref:func>(AccessControlChanges)
Callback where the caller can track progress of the operation as well as collect paths that failed to change Access Control. |
continuation_token
|
Optional continuation token that can be used to resume previously stopped operation. |
batch_size
|
Optional. If data set size exceeds batch size then operation will be split into multiple requests so that progress can be tracked. Batch size should be between 1 and 2000. The default when unspecified is 2000. |
max_batches
|
Optional. Defines maximum number of batches that single change Access Control operation can execute. If maximum is reached before all sub-paths are processed then, continuation token can be used to resume operation. Empty value indicates that maximum number of batches in unbound and operation continues till end. |
continue_on_failure
|
If set to False, the operation will terminate quickly on encountering user errors (4XX). If True, the operation will ignore user errors and proceed with the operation on other sub-entities of the directory. Continuation token will only be returned when continue_on_failure is True in case of user errors. If not set the default value is False for this. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
A summary of the recursive operations, including the count of successes and failures, as well as a continuation token in case the operation was terminated prematurely. |
Exceptions
Type | Description |
---|---|
User can restart the operation using continuation_token field of AzureError if the token is available. |
rename_file
Rename the source file.
rename_file(new_name: str, **kwargs: Any) -> DataLakeFileClient
Parameters
Name | Description |
---|---|
new_name
Required
|
the new file name the user want to rename to. The value must have the following format: "{filesystem}/{directory}/{subdirectory}/{file}". |
Keyword-Only Parameters
Name | Description |
---|---|
content_settings
|
ContentSettings object used to set path properties. |
source_lease
|
A lease ID for the source path. If specified, the source path must have an active lease and the lease ID must match. |
lease
|
Required if the file/directory has an active lease. Value can be a LeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
source_if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
source_if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
source_etag
|
The source ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
source_match_condition
|
The source match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
the renamed file client |
Examples
Rename the source file.
new_client = file_client.rename_file(file_client.file_system_name + '/' + 'newname')
set_access_control
Set the owner, group, permissions, or access control list for a path.
set_access_control(owner: str | None = None, group: str | None = None, permissions: str | None = None, acl: str | None = None, **kwargs) -> Dict[str, str | datetime]
Parameters
Name | Description |
---|---|
owner
Required
|
Optional. The owner of the file or directory. |
group
Required
|
Optional. The owning group of the file or directory. |
permissions
Required
|
Optional and only valid if Hierarchical Namespace is enabled for the account. Sets POSIX access permissions for the file owner, the file owning group, and others. Each class may be granted read, write, or execute permission. The sticky bit is also supported. Both symbolic (rwxrw-rw-) and 4-digit octal notation (e.g. 0766) are supported. permissions and acl are mutually exclusive. |
acl
Required
|
Sets POSIX access control rights on files and directories. The value is a comma-separated list of access control entries. Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format "[scope:][type]:[id]:[permissions]". permissions and acl are mutually exclusive. |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
Required if the file/directory has an active lease. Value can be a LeaseClient object or the lease ID as a string. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
response dict containing access control options (Etag and last modified). |
set_access_control_recursive
Sets the Access Control on a path and sub-paths.
set_access_control_recursive(acl: str, **kwargs: Any) -> AccessControlChangeResult
Parameters
Name | Description |
---|---|
acl
Required
|
Sets POSIX access control rights on files and directories. The value is a comma-separated list of access control entries. Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format "[scope:][type]:[id]:[permissions]". |
Keyword-Only Parameters
Name | Description |
---|---|
progress_hook
|
<xref:func>(AccessControlChanges)
Callback where the caller can track progress of the operation as well as collect paths that failed to change Access Control. |
continuation_token
|
Optional continuation token that can be used to resume previously stopped operation. |
batch_size
|
Optional. If data set size exceeds batch size then operation will be split into multiple requests so that progress can be tracked. Batch size should be between 1 and 2000. The default when unspecified is 2000. |
max_batches
|
Optional. Defines maximum number of batches that single change Access Control operation can execute. If maximum is reached before all sub-paths are processed, then continuation token can be used to resume operation. Empty value indicates that maximum number of batches in unbound and operation continues till end. |
continue_on_failure
|
If set to False, the operation will terminate quickly on encountering user errors (4XX). If True, the operation will ignore user errors and proceed with the operation on other sub-entities of the directory. Continuation token will only be returned when continue_on_failure is True in case of user errors. If not set the default value is False for this. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
A summary of the recursive operations, including the count of successes and failures, as well as a continuation token in case the operation was terminated prematurely. |
Exceptions
Type | Description |
---|---|
User can restart the operation using continuation_token field of AzureError if the token is available. |
set_file_expiry
Sets the time a file will expire and be deleted.
set_file_expiry(expiry_options: str, expires_on: datetime | int | None = None, **kwargs) -> None
Parameters
Name | Description |
---|---|
expiry_options
Required
|
Required. Indicates mode of the expiry time. Possible values include: 'NeverExpire', 'RelativeToCreation', 'RelativeToNow', 'Absolute' |
expires_on
Required
|
The time to set the file to expiry. When expiry_options is RelativeTo*, expires_on should be an int in milliseconds. If the type of expires_on is datetime, it should be in UTC time. |
Keyword-Only Parameters
Name | Description |
---|---|
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
set_http_headers
Sets system properties on the file or directory.
If one property is set for the content_settings, all properties will be overridden.
set_http_headers(content_settings: ContentSettings | None = None, **kwargs) -> Dict[str, Any]
Parameters
Name | Description |
---|---|
content_settings
Required
|
ContentSettings object used to set file/directory properties. |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
If specified, set_file_system_metadata only succeeds if the file system's lease is active and matches this ID. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
file/directory-updated property dict (Etag and last modified) |
set_metadata
Sets one or more user-defined name-value pairs for the specified file system. Each call to this operation replaces all existing metadata attached to the file system. To remove all metadata from the file system, call this operation with no metadata dict.
set_metadata(metadata: Dict[str, str], **kwargs) -> Dict[str, str | datetime]
Parameters
Name | Description |
---|---|
metadata
Required
|
A dict containing name-value pairs to associate with the file system as metadata. Example: {'category':'test'} |
Keyword-Only Parameters
Name | Description |
---|---|
lease
|
If specified, set_file_system_metadata only succeeds if the file system's lease is active and matches this ID. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Encrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
file system-updated property dict (Etag and last modified). |
update_access_control_recursive
Modifies the Access Control on a path and sub-paths.
update_access_control_recursive(acl: str, **kwargs: Any) -> AccessControlChangeResult
Parameters
Name | Description |
---|---|
acl
Required
|
Modifies POSIX access control rights on files and directories. The value is a comma-separated list of access control entries. Each access control entry (ACE) consists of a scope, a type, a user or group identifier, and permissions in the format "[scope:][type]:[id]:[permissions]". |
Keyword-Only Parameters
Name | Description |
---|---|
progress_hook
|
<xref:func>(AccessControlChanges)
Callback where the caller can track progress of the operation as well as collect paths that failed to change Access Control. |
continuation_token
|
Optional continuation token that can be used to resume previously stopped operation. |
batch_size
|
Optional. If data set size exceeds batch size then operation will be split into multiple requests so that progress can be tracked. Batch size should be between 1 and 2000. The default when unspecified is 2000. |
max_batches
|
Optional. Defines maximum number of batches that single change Access Control operation can execute. If maximum is reached before all sub-paths are processed, then continuation token can be used to resume operation. Empty value indicates that maximum number of batches in unbound and operation continues till end. |
continue_on_failure
|
If set to False, the operation will terminate quickly on encountering user errors (4XX). If True, the operation will ignore user errors and proceed with the operation on other sub-entities of the directory. Continuation token will only be returned when continue_on_failure is True in case of user errors. If not set the default value is False for this. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. |
Returns
Type | Description |
---|---|
A summary of the recursive operations, including the count of successes and failures, as well as a continuation token in case the operation was terminated prematurely. |
Exceptions
Type | Description |
---|---|
User can restart the operation using continuation_token field of AzureError if the token is available. |
upload_data
Upload data to a file.
upload_data(data: bytes | str | Iterable | IO, length: int | None = None, overwrite: bool | None = False, **kwargs) -> Dict[str, Any]
Parameters
Name | Description |
---|---|
data
Required
|
Content to be uploaded to file |
length
Required
|
Size of the data in bytes. |
overwrite
Required
|
to overwrite an existing file or not. |
Keyword-Only Parameters
Name | Description |
---|---|
content_settings
|
ContentSettings object used to set path properties. |
metadata
|
Name-value pairs associated with the blob as metadata. |
lease
|
Required if the blob has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string. |
umask
|
Optional and only valid if Hierarchical Namespace is enabled for the account. When creating a file or directory and the parent folder does not have a default ACL, the umask restricts the permissions of the file or directory to be created. The resulting permission is given by p & ^u, where p is the permission and u is the umask. For example, if p is 0777 and u is 0057, then the resulting permission is 0720. The default permission is 0777 for a directory and 0666 for a file. The default umask is 0027. The umask must be specified in 4-digit octal notation (e.g. 0766). |
permissions
|
Optional and only valid if Hierarchical Namespace is enabled for the account. Sets POSIX access permissions for the file owner, the file owning group, and others. Each class may be granted read, write, or execute permission. The sticky bit is also supported. Both symbolic (rwxrw-rw-) and 4-digit octal notation (e.g. 0766) are supported. |
if_modified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time. |
if_unmodified_since
|
A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time. |
validate_content
|
If true, calculates an MD5 hash for each chunk of the file. The storage service checks the hash of the content that has arrived with the hash that was sent. This is primarily valuable for detecting bitflips on the wire if using http instead of https, as https (the default), will already validate. Note that this MD5 hash is not stored with the blob. Also note that if enabled, the memory-efficient upload algorithm will not be used because computing the MD5 hash requires buffering entire blocks, and doing so defeats the purpose of the memory-efficient algorithm. |
etag
|
An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter. |
match_condition
|
The match condition to use upon the etag. |
cpk
|
Encrypts the data on the service-side with the given key. Use of customer-provided keys must be done over HTTPS. |
timeout
|
Sets the server-side timeout for the operation in seconds. For more details see https://learn.microsoft.com/rest/api/storageservices/setting-timeouts-for-blob-service-operations. This value is not tracked or validated on the client. To configure client-side network timesouts see here. This method may make multiple calls to the service and the timeout will apply to each call individually. |
max_concurrency
|
Maximum number of parallel connections to use when transferring the file in chunks. This option does not affect the underlying connection pool, and may require a separate configuration of the connection pool. |
chunk_size
|
The maximum chunk size for uploading a file in chunks.
Defaults to |
encryption_context
|
Specifies the encryption context to set on the file. |
Returns
Type | Description |
---|---|
response dict (Etag and last modified). |
Attributes
api_version
location_mode
The location mode that the client is currently using.
By default this will be "primary". Options include "primary" and "secondary".
Returns
Type | Description |
---|---|
primary_endpoint
primary_hostname
secondary_endpoint
The full secondary endpoint URL if configured.
If not available a ValueError will be raised. To explicitly specify a secondary hostname, use the optional secondary_hostname keyword argument on instantiation.
Returns
Type | Description |
---|---|
Exceptions
Type | Description |
---|---|
secondary_hostname
The hostname of the secondary endpoint.
If not available this will be None. To explicitly specify a secondary hostname, use the optional secondary_hostname keyword argument on instantiation.
Returns
Type | Description |
---|---|
url
The full endpoint URL to this entity, including SAS token if used.
This could be either the primary endpoint, or the secondary endpoint depending on the current location_mode. :returns: The full endpoint URL to this entity, including SAS token if used. :rtype: str
Azure SDK for Python