An Apache Spark-based analytics platform optimized for Azure.
Q1 — Branching strategy The proposed strategy is aligned with the recommended patterns, with a small adjustment:
- Recommended flow from the Databricks CI/CD guidance:
- Develop locally or in the workspace and deploy to a Databricks development workspace to test changes.
- Create a feature branch to version control updates and regularly sync local/workspace changes.
- When testing is finished, merge the feature branch into main.
- CI/CD automatically deploys main to a staging workspace and runs automated tests.
- When staging checks pass, CI/CD deploys main to production.
- For Azure DevOps more generally, the guidance is also to keep a simple strategy: main as the continuous development branch, plus release branches as needed, with CI triggered on every check-in to main and release branches.
Given that, two viable options are:
- Simple (recommended for most teams):
- Feature branches → merge into main.
- CI on main deploys to Dev (and optionally to Staging → Prod via promotion).
- Use environments/approvals to control promotion to Prod.
- Two long‑lived branches (what is proposed):
- Feature branches from dev.
- Merge to dev → deploy to Dev workspace.
- Merge to main → deploy to Prod workspace.
Both are compatible with Databricks bundles and Azure DevOps. The documentation, however, describes feature branches merging into main and then using CI/CD to move from staging to production, so feature branches based off main with environment-based promotion is closer to the documented best practice.
Q2 — Auto-discovery of new notebooks From the bundles best practices and library-dependency guidance:
- Bundles require resources (jobs, pipelines, libraries) to be defined in source and referenced in
databricks.yml. - The recommendation is to “reference the uploaded compiled library in
databricks.yml” and to define resources declaratively.
Implication:
- Adding a new notebook under the
srcpath does not automatically create or wire up a new job/pipeline in the bundle. - Any new workflow (job, pipeline, etc.) must be explicitly declared in the bundle configuration so that
databricks bundle validateanddatabricks bundle deployknow what to deploy.
Q3 — Azure DevOps YAML pipeline triggers From the Azure DevOps + Databricks CI/CD guidance:
- Pipelines are defined in YAML (
azure-pipelines.yml) and can be customized per branch using the Git branch selector. - Best practice is not to do production work directly in
main, and to use a dedicated branch (for examplerelease) for production deployment. - Builds are typically triggered with every check-in to the relevant branch, and release pipelines or stages handle promotion across environments (Dev → QA → UAT → Staging → Prod).
For multi-environment deployments with bundles, two common patterns are supported by the docs and DevOps guidance:
- Single multi-stage YAML pipeline
- One
azure-pipelines.ymlwith stages likeBuild,Deploy_Dev,Deploy_Prod. - Use
triggerand/orconditionon stages to run Dev on one branch and Prod on another, for example:- Trigger on both
devandmain. -
Deploy_Devstage runs whenBuild.SourceBranchisrefs/heads/dev. -
Deploy_Prodstage runs whenBuild.SourceBranchisrefs/heads/main.
- Trigger on both
- This aligns with the “simple branching strategy” and “deploy multiple branches to different stages” guidance, where different branches feed different stages/environments.
- One
- Separate YAML pipelines per environment
- One YAML file bound to
devbranch (deploys to Dev workspace). - Another YAML file bound to
main(deploys to Prod workspace). - Azure DevOps supports customizing the build process per branch via the branch selector in the pipeline editor.
- One YAML file bound to
Both are supported. The documentation leans toward a single pipeline with multiple stages and branch-based routing to stages when managing multiple environments, because it keeps the flow centralized and easier to reason about.
Q4 — DAB target configuration The target configuration shown matches the documented pattern for bundles:
targets:
dev:
mode: development
workspace:
host: https://<dev-workspace>.azuredatabricks.net
prod:
mode: production
workspace:
host: https://<prod-workspace>.azuredatabricks.net
- Bundles explicitly support multiple targets (for example
dev,prod) that map to different workspaces and modes. - The recommended workflow is to validate and deploy the bundle per target using
databricks bundle validateanddatabricks bundle deployfor the appropriate target.
For authentication:
- The GitHub Actions example in the documentation uses a service principal and sets the
DATABRICKS_TOKENenvironment variable from a secret (SP_TOKEN). - The same pattern applies to Azure DevOps:
- Use a service principal associated with the Databricks workspace.
- Generate a Databricks access token for that principal.
- Store the token as a secret variable in Azure DevOps (variable group or pipeline variable marked secret).
- In the pipeline, set
DATABRICKS_TOKEN(and any other required unified auth environment variables) from that secret before running bundle commands.
Authentication details are typically not stored in databricks.yml targets; instead, targets define workspace hosts and modes, while credentials are injected at runtime via environment variables or service connections.
Q5 — PR gate: deploy-before-merge vs merge-then-deploy From the documented CI/CD flow for bundles:
- The recommended sequence is:
- Develop and test in a development workspace.
- Use feature branches and merge into main after testing.
- CI/CD then deploys main to staging and, after tests, to production.
This maps well to:
- PR from feature → integration branch (dev or main).
- CI validation on PR (build,
databricks bundle validate, possibly a test deployment to a Dev workspace). - After PR approval and merge, the main pipeline deploys to the appropriate environment(s).
The proposed plan:
- PR
feature→dev. - CI pipeline deploys bundle to Dev workspace as a PR validation gate.
- If deployment succeeds, PR is approved and merged.
- Merge into
maintriggers deployment to Prod.
This is compatible with the documented guidance, as long as:
- The Dev deployment used as a gate is against a non-production workspace (which matches the “develop in dev workspace” recommendation).
- Production deployment is only triggered from a stable branch (for example
mainorrelease) after tests and checks pass.
The alternative, also supported by the docs, is to:
- Run validation (including
databricks bundle validate) on PR without deploying, then deploy to Dev only after merge.
Both are valid; using a Dev deployment as a PR gate is a stricter form of validation and fits within the recommended CI/CD model for bundles.
References:
- Best practices and recommended CI/CD workflows on Databricks
- CI/CD with Databricks Git folders
- Continuous integration and delivery on Azure Databricks using Azure DevOps
- GitHub Actions
- Choose a branching strategy with a DevOps mindset
- Deploy multiple branches to different stages with Classic release pipelines