Migrate on-premises Apache Hadoop clusters to Azure HDInsight - motivation and benefits

This article is the first in a series on best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight. This series of articles is for people who are responsible for the design, deployment, and migration of Apache Hadoop solutions in Azure HDInsight. The roles that may benefit from these articles include cloud architects, Hadoop administrators, and DevOps engineers. Software developers, data engineers, and data scientists should also benefit from the explanation of how different types of clusters work in the cloud.

Why to migrate to Azure HDInsight

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. HDInsight includes the most popular open-source frameworks such as:

  • Apache Hadoop
  • Apache Spark
  • Apache Hive with LLAP
  • Apache Kafka
  • Apache HBase

Azure HDInsight advantages over on-premises Hadoop

  • Low cost - Costs can be reduced by creating clusters on demand and paying only for what you use. Decoupled compute and storage provides flexibility by keeping the data volume independent of the cluster size.

  • Automated cluster creation - Automated cluster creation requires minimal setup and configuration. Automation can be used for on-demand clusters.

  • Managed hardware and configuration - There's no need to worry about the physical hardware or infrastructure with an HDInsight cluster. Just specify the configuration of the cluster, and Azure sets it up.

  • Easily scalable - HDInsight enables you to scale workloads up or down. Azure takes care of data redistribution and workload rebalancing without interrupting data processing jobs.

  • Global availability - HDInsight is available in more regions than any other big data analytics offering. Azure HDInsight is also available in Azure Government, China, and Germany, which allows you to meet your enterprise needs in key sovereign areas.

  • Secure and compliant - HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Microsoft Entra ID. HDInsight also meets the most popular industry and government compliance standards.

  • Simplified version management - Azure HDInsight manages the version of Hadoop eco-system components and keeps them up to date. Software updates are usually a complex process for on-premises deployments.

  • Smaller clusters optimized for specific workloads with fewer dependencies between components - A typical on-premises Hadoop setup uses a single cluster that serves many purposes. With Azure HDInsight, workload-specific clusters can be created. Creating clusters for specific workloads removes the complexity of maintaining a single cluster with growing complexity.

  • Productivity - You can use various tools for Hadoop and Spark in your preferred development environment.

  • Extensibility with custom tools or third-party applications - HDInsight clusters can be extended with installed components and can also be integrated with the other big data solutions by using one-click deployments from the Azure Market place.

  • Easy management, administration, and monitoring - Azure HDInsight integrates with Azure Monitor logs to provide a single interface with which you can monitor all your clusters.

  • Integration with other Azure services - HDInsight can easily be integrated with other popular Azure services such as the following:

    • Фабрика данных Azure (ADF)
    • Azure Blob-хранилище
    • Azure Data Lake Storage 2-го поколения
    • Azure Cosmos DB (облачная база данных)
    • База данных SQL Azure
    • Azure Analysis Services
  • Self-healing processes and components - HDInsight constantly checks the infrastructure and open-source components using its own monitoring infrastructure. It also automatically recovers critical failures such as unavailability of open-source components and nodes. Alerts are triggered in Ambari if any OSS component is failed.

For more information, see the article What is Azure HDInsight and the Apache Hadoop technology stack.

Migration planning process

The following steps are recommended for planning a migration of on-premises Hadoop clusters to Azure HDInsight:

  1. Understand the current on-premises deployment and topologies.
  2. Understand the current project scope, timelines, and team expertise.
  3. Understand the Azure requirements.
  4. Build out a detailed plan based on best practices.

Gathering details to prepare for a migration

This section provides template questionnaires to help gather important information about:

  • The on-premises deployment
  • Сведения о проекте
  • Требования Azure

On-premises deployment questionnaire

Вопрос Пример Ответ
Topic: Environment
Cluster Distribution version HDP 2.6.5, CDH 5.7
Big Data eco-system components HDFS, Yarn, Hive, LLAP, Impala, Kudu, HBase, Spark, MapReduce, Kafka, Zookeeper, Solr, Sqoop, Oozie, Ranger, Atlas, Falcon, Zeppelin, R
Типы кластера Hadoop, Spark, Confluent Kafka, Solr
Number of clusters 4
Number of master nodes 2
Количество рабочих узлов 100
Number of edge nodes 5
Total Disk space 100 ТБ
Master Node configuration m/y, cpu, disk, etc.
Data Nodes configuration m/y, cpu, disk, etc.
Edge Nodes configuration m/y, cpu, disk, etc.
HDFS Encryption? Да
Высокий уровень доступности HDFS HA, Metastore HA
Disaster Recovery / Back up Backup cluster?
Systems that are dependent on Cluster SQL Server, Teradata, Power BI, MongoDB
Интеграция сторонних продуктов Tableau, GridGain, Qubole, Informatica, Splunk
Topic: Security
Безопасность периметра Брандмауэры
Cluster authentication & authorization Active Directory, Ambari, Cloudera Manager, No authentication
HDFS Access Control Manual, ssh users
Hive authentication & authorization Sentry, LDAP, AD with Kerberos, Ranger
Аудит Ambari, Cloudera Navigator, Ranger
Контроль Graphite, collectd, statsd, Telegraf, InfluxDB
Оповещение Kapacitor, Prometheus, Datadog
Data Retention duration Three years, five years
Cluster Administrators Single Administrator, Multiple Administrators

Project details questionnaire

Вопрос Пример Ответ
Topic: Workloads and Frequency
MapReduce jobs 10 jobs--twice daily
Hive jobs 100 jobs--every hour
Spark batch jobs 50 jobs--every 15 minutes
Spark Streaming jobs 5 jobs--every 3 minutes
Structured Streaming jobs 5 jobs--every minute
Языки программирования Python, Scala, Java
Скриптинг Shell, Python
Topic: Data
Источники данных Flat files, JSON, Kafka, RDBMS
Оркестрация данных Oozie workflows, Airflow
In memory lookups Apache Ignite, Redis
Data destinations HDFS, RDBMS, Kafka, MPP
Topic: Meta data
Hive DB type Mysql, Postgres
Number of Hive metastores 2
Number of Hive tables 100
Number of Ranger policies 20
Number of Oozie workflows 100
Topic: Scale
Data volume including Replication 100 ТБ
Daily ingestion volume 50 ГБ
Data growth rate 10% per year
Cluster Nodes growth rate 5% per year
Topic: Cluster utilization
Average CPU % used 60 %
Average Memory % used 75%
Disk space used 75%
Average Network % used 25%
Topic: Staff
Number of Administrators 2
Number of Developers 10
Number of end users 100
Навыки Hadoop, Spark
Number of available resources for Migration efforts 2
Topic: Limitations
Текущие ограничения Latency is high
Current challenges Concurrency issue

Azure requirements questionnaire

Вопрос Пример Ответ
Topic: Infrastructure
Preferred Region Восточная часть США
VNet preferred? Да
HA / DR Needed? Да
Integration with other cloud services? ADF, Azure Cosmos DB
Topic: Data Movement
Initial load preference DistCp, Data box, ADF, WANDisco
Data transfer delta DistCp, AzCopy
Ongoing incremental data transfer DistCp, Sqoop
Topic: Monitoring & Alerting
Use Azure Monitoring & Alerting vs Integrate third-party monitoring Use Azure Monitoring & Alerting
Topic: Security preferences
Private and protected data pipeline? Да
Domain Joined cluster (ESP)? Да
On-Premises AD Sync to Cloud? Да
Number of AD users to sync? 100
Ok to sync passwords to cloud? Да
Cloud only Users? Да
MFA needed? нет
Data authorization requirements? Да
Role-based access control? Да
Auditing needed? Да
Data encryption at rest? Да
Data encryption in transit? Да
Topic: Re-Architecture preferences
Single cluster vs Specific cluster types Specific cluster types
Colocated Storage Vs Remote Storage? Удаленное хранилище
Smaller cluster size as data is stored remotely? Smaller cluster size
Use multiple smaller clusters rather than a single large cluster? Use multiple smaller clusters
Use a remote metastore? Да
Share metastores between different clusters? Да
Deconstruct workloads? Replace Hive jobs with Spark jobs
Use ADF for data orchestration? нет

Дальнейшие действия

Прочитайте следующую статью в этом цикле: