Edit

Share via


Choose a data pipeline orchestration technology in Azure

Most big data solutions consist of repeated data processing operations, encapsulated in workflows. A pipeline orchestrator helps automate these workflows. It can schedule jobs, run workflows, and coordinate dependencies among tasks.

Options for data pipeline orchestration

In Azure, the following services and tools meet the core requirements for pipeline orchestration, control flow, and data movement:

You can use these services and tools independently or combine them to create a hybrid solution. For example, the integration runtime (IR) in Data Factory V2 can natively run SSIS packages in a managed Azure compute environment. These services share some functionality, but they have a few key differences.

Key selection criteria

To narrow your options, consider the following factors:

  • Determine whether you need big data capabilities to move and transform your data. These capabilities typically use multiple gigabytes (GBs) to terabytes (TBs) of data. If you require these capabilities, choose a service designed for big data.

  • Identify whether you need a managed service that can operate at scale. If you do, choose a cloud-based service that doesn't depend on your local processing power.

  • Check whether you have data sources located on-premises. If you do, choose a service that supports both cloud and on-premises data sources or destinations.

  • Check whether you store source data in blob storage on a Hadoop Distributed File System (HDFS). If you do, choose a service that supports Hive queries.

  • Determine whether you need advanced orchestration for complex extract, transform, and load (ETL) workflows across multiple data sources. If you do, choose Fabric Data Factory because it provides a set of connectors, pipeline orchestration, and integration with both on-premises and cloud environments. It's ideal for enterprise-scale data movement and transformation.

Capability matrix

The following tables summarize the key differences in capabilities.

General capabilities

Capability Data Factory SSIS Oozie on HDInsight Fabric Data Factory
Managed Yes No Yes Yes
Cloud-based Yes No (local) Yes Yes
Prerequisite Azure subscription SQL Server Azure subscription, HDInsight cluster Fabric-enabled workspace
Management tools Azure portal, PowerShell, CLI, .NET SDK SQL Server Management Studio (SSMS), PowerShell Bash shell, Oozie REST API, Oozie web user interface (UI) Copy job, mirroring, pipeline activities, Dataflow Gen2
Pricing Pay per usage Licensing, extra features add cost Included with HDInsight cluster Included with Fabric capacity

Pipeline capabilities

Capability Data Factory SSIS Oozie on HDInsight Fabric Data Factory
Copy data Yes Yes Yes Yes
Custom transformations Yes Yes Yes (MapReduce, Pig, and Hive jobs) Yes
Azure Machine Learning scoring Yes Yes (with scripting) No Yes (via integration)
HDInsight on-demand Yes No No No
Azure Batch Yes No No Yes
Pig, Hive, and MapReduce Yes No Yes Yes
Apache Spark Yes No No Yes
Run SSIS packages Yes Yes No Yes
Control flow Yes Yes Yes Yes
Access on-premises data Yes Yes No Yes

Scalability capabilities

Capability Data Factory SSIS Oozie on HDInsight Fabric Data Factory
Scale up Yes No No Yes
Scale out Yes No Yes (by adding worker nodes to cluster) Yes
Optimized for big data Yes No Yes Yes

Alternative approach

In addition to traditional batch-based orchestration, your platform can also use real-time intelligence through the Fabric Real-Time Intelligence feature. This approach enables continuous streaming data ingestion, in-flight transformation, and event-driven workflows so that you can respond instantly as data arrives. It supports high-value scenarios such as Internet of Things (IoT) telemetry processing, fraud detection, and operational monitoring.

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal author:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps