Hello Akshay Kudale!
Thank you for reaching out to the Microsoft Q&A platform. Happy to answer your question.
This "Network is unreachable" error during authentication to login.windows.net (Azure AD) from Databricks, especially when using a Service Principal, is a classic sign of intermittent network connectivity or DNS resolution issues from the Databricks cluster's perspective. Since it's intermittent, it points to temporary disruptions rather than a constant blocking.
What is happening in background is your Databricks cluster needs to reach Azure AD's authentication endpoint (login.windows.net on port 443) to obtain the access token for your Service Principal. The HTTPSConnectionPool: Max retries exceeded error, combined with Network is unreachable, means the Databricks cluster couldn't establish a basic TCP connection to Azure AD to even begin the authentication process.
This could be due to:
• Transient Network Glitches: Temporary routing issues or packet loss within Azure's network or between Databricks' control plane and Azure AD.
• DNS Resolution Issues: Intermittent problems where the Databricks cluster struggles to resolve login.windows.net to its IP address.
• Outbound Firewall/NSG Rules: Less likely if it's intermittent (a permanent block would be consistent), but worth verifying if some network paths are sometimes blocked.
How to Permanently Address This Issue
The solution involves ensuring robust and consistent network outbound connectivity from your Databricks cluster to Azure AD.
- Verify Databricks VNet Injection and Network Configuration: Link
• If your Databricks workspace is deployed with VNet Injection (which is recommended for advanced networking control), the network configuration of the VNet is critical.
• Check NSG Rules: Ensure the Network Security Groups (NSGs) applied to the Databricks subnets (public and private) have outbound rules allowing traffic to Azure Active Directory. This typically means:
a) Destination: Service Tag AzureActiveDirectory
b) Destination Port: 443 (HTTPS)
c) Protocol: TCP
d) Action: Allow
e) Ensure the priority of this rule is higher (lower number) than any deny rules that might capture this traffic.
o User-Defined Routes (UDRs) / Firewall Appliances: If you're routing outbound traffic through a firewall appliance (e.g., Azure Firewall) or using custom UDRs, ensure that login.windows.net (or the AzureActiveDirectory service tag) is explicitly allowed egress through that appliance/route. Intermittent failures can happen if the appliance is under load or has transient issues.
o For more understanding follow these references:
- Databricks documentation on Network Security Group rules: User-defined route settings for Azure Databricks (This is crucial for VNet-injected workspaces).
- Azure Service Tags overview (for AzureActiveDirectory): Azure Service Tags overview
2. DNS Resolution Check:
o Ensure your Databricks VNet (if injected) is configured with reliable DNS servers. If you're using custom DNS (e.g., Azure DNS private zones or your own DNS servers), verify their stability and ability to resolve public endpoints like login.windows.net consistently. Intermittent DNS resolution failures can lead to "Network unreachable" errors.
o Action: Test DNS resolution from within a Databricks notebook if possible (e.g., using a %sh dig login.windows.net command to see resolution times and success rates).
o For more understanding follows these references:
- Azure DNS overview (for VNet DNS settings): What is Azure DNS?
3. Implement Retry Logic in your PySpark Code (Workaround/Mitigation):
o While the above steps address the root cause, for highly critical workflows, implementing retry logic with exponential backoff around your update_audit_log function (or the token acquisition part) can make your workflow more resilient to transient network issues. This won't fix the underlying problem but makes your jobs more robust.
4. Monitor Databricks Cluster Health and Logs:
o Regularly check Databricks cluster logs and metrics for any signs of network instability or resource contention that might lead to these intermittent issues.
Given that it's intermittent and specifically affecting login.windows.net, the most likely culprit is an NSG rule or UDR issue that isn't always active, or a transient DNS problem. Start by thoroughly reviewing your Databricks VNet's outbound NSG rules for the AzureActiveDirectory service tag.
Please check out similar issues [Issue 1](https://learn.microsoft.com/en-us/answers/questions/1065925/error-httpsconnectionpool(host-login-microsoftonli) Issue 2 . The right thing to do would be to ensure all the endpoints are allowed through the firewall.
Please "Accept as Answer" and Upvote if the answer provided is useful, so that you can help others in the community looking for remediation for similar issues.