Quickstart: Create Apache Spark cluster in Azure HDInsight using Azure CLI
In this quickstart, you learn how to create an Apache Spark cluster in Azure HDInsight using the Azure CLI. Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. The Apache Spark framework for HDInsight enables fast data analytics and cluster computing using in-memory processing. The Azure CLI is Microsoft's cross-platform command-line experience for managing Azure resources.
If you're using multiple clusters together, you can create a virtual network, and if you're using a Spark cluster you can use the Hive Warehouse Connector. For more information, see Plan a virtual network for Azure HDInsight and Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector.
If you don't have an Azure subscription, create an Azure free account before you begin.
Prerequisites
Use the Bash environment in Azure Cloud Shell. For more information, see Quickstart for Bash in Azure Cloud Shell.
If you prefer to run CLI reference commands locally, install the Azure CLI. If you're running on Windows or macOS, consider running Azure CLI in a Docker container. For more information, see How to run the Azure CLI in a Docker container.
If you're using a local installation, sign in to the Azure CLI by using the az login command. To finish the authentication process, follow the steps displayed in your terminal. For other sign-in options, see Sign in with the Azure CLI.
When you're prompted, install the Azure CLI extension on first use. For more information about extensions, see Use extensions with the Azure CLI.
Run az version to find the version and dependent libraries that are installed. To upgrade to the latest version, run az upgrade.
Create an Apache Spark cluster
Sign in to your Azure subscription. If you plan to use Azure Cloud Shell, select Try it in the upper-right corner of the following code block. Else, enter the following command:
az login # If you have multiple subscriptions, set the one to use # az account set --subscription "SUBSCRIPTIONID"
Set environment variables. The use of variables in this quickstart is based on Bash. Slight variations are needed for other environments. Replace RESOURCEGROUPNAME, LOCATION, CLUSTERNAME, STORAGEACCOUNTNAME, and PASSWORD in the following code snippet with the desired values. Then enter the CLI commands to set the environment variables.
export resourceGroupName=RESOURCEGROUPNAME export location=LOCATION export clusterName=CLUSTERNAME export AZURE_STORAGE_ACCOUNT=STORAGEACCOUNTNAME export httpCredential='PASSWORD' export sshCredentials='PASSWORD' export AZURE_STORAGE_CONTAINER=$clusterName export clusterSizeInNodes=1 export clusterVersion=4.0 export clusterType=spark export componentVersion=Spark=2.3
Create the resource group by entering the following command:
az group create \ --location $location \ --name $resourceGroupName
Create an Azure storage account by entering the following command:
az storage account create \ --name $AZURE_STORAGE_ACCOUNT \ --resource-group $resourceGroupName \ --https-only true \ --kind StorageV2 \ --location $location \ --sku Standard_LRS
Extract the primary key from the Azure storage account and store it in a variable by entering the following command:
export AZURE_STORAGE_KEY=$(az storage account keys list \ --account-name $AZURE_STORAGE_ACCOUNT \ --resource-group $resourceGroupName \ --query [0].value -o tsv)
Create an Azure storage container by entering the following command:
az storage container create \ --name $AZURE_STORAGE_CONTAINER \ --account-key $AZURE_STORAGE_KEY \ --account-name $AZURE_STORAGE_ACCOUNT
Create the Apache Spark cluster by entering the following command:
az hdinsight create \ --name $clusterName \ --resource-group $resourceGroupName \ --type $clusterType \ --component-version $componentVersion \ --http-password $httpCredential \ --http-user admin \ --location $location \ --workernode-count $clusterSizeInNodes \ --ssh-password $sshCredentials \ --ssh-user sshuser \ --storage-account $AZURE_STORAGE_ACCOUNT \ --storage-account-key $AZURE_STORAGE_KEY \ --storage-container $AZURE_STORAGE_CONTAINER \ --version $clusterVersion
Clean up resources
After you complete the quickstart, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it isn't in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use.
Enter all or some of the following commands to remove resources:
# Remove cluster
az hdinsight delete \
--name $clusterName \
--resource-group $resourceGroupName
# Remove storage container
az storage container delete \
--account-name $AZURE_STORAGE_ACCOUNT \
--name $AZURE_STORAGE_CONTAINER
# Remove storage account
az storage account delete \
--name $AZURE_STORAGE_ACCOUNT \
--resource-group $resourceGroupName
# Remove resource group
az group delete \
--name $resourceGroupName
Next steps
In this quickstart, you learned how to create an Apache Spark cluster in Azure HDInsight using Azure CLI. Advance to the next tutorial to learn how to use an HDInsight cluster to run interactive queries on sample data.