Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Most big data solutions consist of repeated data processing operations, encapsulated in workflows. A pipeline orchestrator helps automate these workflows. It can schedule jobs, run workflows, and coordinate dependencies among tasks.
Options for data pipeline orchestration
In Azure, the following services and tools meet the core requirements for pipeline orchestration, control flow, and data movement:
- Azure Data Factory
- Apache Oozie on Azure HDInsight
- SQL Server Integration Services (SSIS)
- Fabric Data Factory
You can use these services and tools independently or combine them to create a hybrid solution. For example, the integration runtime (IR) in Data Factory V2 can natively run SSIS packages in a managed Azure compute environment. These services share some functionality, but they have a few key differences.
Key selection criteria
To narrow your options, consider the following factors:
Determine whether you need big data capabilities to move and transform your data. These capabilities typically use multiple gigabytes (GBs) to terabytes (TBs) of data. If you require these capabilities, choose a service designed for big data.
Identify whether you need a managed service that can operate at scale. If you do, choose a cloud-based service that doesn't depend on your local processing power.
Check whether you have data sources located on-premises. If you do, choose a service that supports both cloud and on-premises data sources or destinations.
Check whether you store source data in blob storage on a Hadoop Distributed File System (HDFS). If you do, choose a service that supports Hive queries.
Determine whether you need advanced orchestration for complex extract, transform, and load (ETL) workflows across multiple data sources. If you do, choose Fabric Data Factory because it provides a set of connectors, pipeline orchestration, and integration with both on-premises and cloud environments. It's ideal for enterprise-scale data movement and transformation.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
| Capability | Data Factory | SSIS | Oozie on HDInsight | Fabric Data Factory |
|---|---|---|---|---|
| Managed | Yes | No | Yes | Yes |
| Cloud-based | Yes | No (local) | Yes | Yes |
| Prerequisite | Azure subscription | SQL Server | Azure subscription, HDInsight cluster | Fabric-enabled workspace |
| Management tools | Azure portal, PowerShell, CLI, .NET SDK | SQL Server Management Studio (SSMS), PowerShell | Bash shell, Oozie REST API, Oozie web user interface (UI) | Copy job, mirroring, pipeline activities, Dataflow Gen2 |
| Pricing | Pay per usage | Licensing, extra features add cost | Included with HDInsight cluster | Included with Fabric capacity |
Pipeline capabilities
| Capability | Data Factory | SSIS | Oozie on HDInsight | Fabric Data Factory |
|---|---|---|---|---|
| Copy data | Yes | Yes | Yes | Yes |
| Custom transformations | Yes | Yes | Yes (MapReduce, Pig, and Hive jobs) | Yes |
| Azure Machine Learning scoring | Yes | Yes (with scripting) | No | Yes (via integration) |
| HDInsight on-demand | Yes | No | No | No |
| Azure Batch | Yes | No | No | Yes |
| Pig, Hive, and MapReduce | Yes | No | Yes | Yes |
| Apache Spark | Yes | No | No | Yes |
| Run SSIS packages | Yes | Yes | No | Yes |
| Control flow | Yes | Yes | Yes | Yes |
| Access on-premises data | Yes | Yes | No | Yes |
Scalability capabilities
| Capability | Data Factory | SSIS | Oozie on HDInsight | Fabric Data Factory |
|---|---|---|---|---|
| Scale up | Yes | No | No | Yes |
| Scale out | Yes | No | Yes (by adding worker nodes to cluster) | Yes |
| Optimized for big data | Yes | No | Yes | Yes |
Alternative approach
In addition to traditional batch-based orchestration, your platform can also use real-time intelligence through the Fabric Real-Time Intelligence feature. This approach enables continuous streaming data ingestion, in-flight transformation, and event-driven workflows so that you can respond instantly as data arrives. It supports high-value scenarios such as Internet of Things (IoT) telemetry processing, fraud detection, and operational monitoring.
Contributors
Microsoft maintains this article. The following contributors wrote this article.
Principal author:
- Zoiner Tejada | CEO and Architect
To see nonpublic LinkedIn profiles, sign in to LinkedIn.
Next steps
- Pipelines and activities in Fabric Data Factory
- Provision the Azure-SSIS integration runtime in Data Factory
- Use Oozie to run a workflow on HDInsight
- Medallion architecture in Fabric Real-Time Intelligence