Use a custom container to deploy a model to an online endpoint

Article
2025-03-31

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In Azure Machine Learning, you can use a custom container to deploy a model to an online endpoint. Custom container deployments can use web servers other than the default Python Flask server that Azure Machine Learning uses.

When you use a custom deployment, you can:

Use various tools and technologies, such as TensorFlow Serving (TF Serving), TorchServe, Triton Inference Server, the Plumber R package, and the Azure Machine Learning inference minimal image.
Still take advantage of the built-in monitoring, scaling, alerting, and authentication that Azure Machine Learning offers.

This article shows you how to use a TF Serving image to serve a TensorFlow model.

Prerequisites

An Azure Machine Learning workspace. For instructions for creating a workspace, see Create the workspace.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
- Azure CLI
- Python SDK
To install the Azure CLI and the ml extension, see Install and set up the CLI (v2).

The examples in this article assume that you use a Bash shell or a compatible shell. For example, you can use a shell on a Linux system or Windows Subsystem for Linux.
To install the Python SDK v2, use the following command:
```
pip install azure-ai-ml azure-identity
```
To update an existing installation of the SDK to the latest version, use the following command:
```
pip install --upgrade azure-ai-ml azure-identity
```
For more information, see Azure Machine Learning Package client library for Python.

An Azure resource group that contains your workspace and that you or your service principal have Contributor access to. If you use the steps in Create the workspace to configure your workspace, you meet this requirement.
Docker Engine, installed and running locally. This prerequisite is highly recommended. You need it to deploy a model locally, and it's helpful for debugging.

Deployment examples

The following table lists deployment examples that use custom containers and take advantage of various tools and technologies.

Example	Azure CLI script	Description
minimal/multimodel	deploy-custom-container-minimal-multimodel	Deploys multiple models to a single deployment by extending the Azure Machine Learning inference minimal image.
minimal/single-model	deploy-custom-container-minimal-single-model	Deploys a single model by extending the Azure Machine Learning inference minimal image.
mlflow/multideployment-scikit	deploy-custom-container-mlflow-multideployment-scikit	Deploys two MLFlow models with different Python requirements to two separate deployments behind a single endpoint. Uses the Azure Machine Learning inference minimal image.
r/multimodel-plumber	deploy-custom-container-r-multimodel-plumber	Deploys three regression models to one endpoint. Uses the Plumber R package.
tfserving/half-plus-two	deploy-custom-container-tfserving-half-plus-two	Deploys a Half Plus Two model by using a TF Serving custom container. Uses the standard model registration process.
tfserving/half-plus-two-integrated	deploy-custom-container-tfserving-half-plus-two-integrated	Deploys a Half Plus Two model by using a TF Serving custom container with the model integrated into the image.
torchserve/densenet	deploy-custom-container-torchserve-densenet	Deploys a single model by using a TorchServe custom container.
triton/single-model	deploy-custom-container-triton-single-model	Deploys a Triton model by using a custom container.

This article shows you how to use the tfserving/half-plus-two example.

Warning

Microsoft support teams might not be able to help troubleshoot problems caused by a custom image. If you encounter problems, you might be asked to use the default image or one of the images that Microsoft provides to see whether the problem is specific to your image.

Download the source code

The steps in this article use code samples from the azureml-examples repository. Use the following commands to clone the repository:

Azure CLI
Python SDK

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/cli

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/cli

In the examples repository, most Python samples are under the sdk/python folder. For this article, go to the cli folder instead. The folder structure under the cli folder is slightly different than the sdk/python structure in this case. Most steps in this article require the cli structure.

To follow along with the example steps, see a Jupyter notebook in the examples repository. But in the following sections of that notebook, the steps run from the azureml-examples/sdk/python folder instead of the cli folder:

1. Test locally
1. Test the endpoint with sample data

Initialize environment variables

To use a TensorFlow model, you need several environment variables. Run the following commands to define those variables:

BASE_PATH=endpoints/online/custom-container/tfserving/half-plus-two
AML_MODEL_NAME=tfserving-mounted
MODEL_NAME=half_plus_two
MODEL_BASE_PATH=/var/azureml-app/azureml-models/$AML_MODEL_NAME/1

Download a TensorFlow model

Download and unzip a model that divides an input value by two and adds two to the result:

wget https://aka.ms/half_plus_two-model -O $BASE_PATH/half_plus_two.tar.gz
tar -xvf $BASE_PATH/half_plus_two.tar.gz -C $BASE_PATH

Test a TF Serving image locally

Use Docker to run your image locally for testing:

docker run --rm -d -v $PWD/$BASE_PATH:$MODEL_BASE_PATH -p 8501:8501 \
 -e MODEL_BASE_PATH=$MODEL_BASE_PATH -e MODEL_NAME=$MODEL_NAME \
 --name="tfserving-test" docker.io/tensorflow/serving:latest
sleep 10

Send liveness and scoring requests to the image

Send a liveness request to check that the process inside the container is running. You should get a response status code of 200 OK.

curl -v http://localhost:8501/v1/models/$MODEL_NAME

Send a scoring request to check that you can get predictions about unlabeled data:

curl --header "Content-Type: application/json" \
  --request POST \
  --data @$BASE_PATH/sample_request.json \
  http://localhost:8501/v1/models/$MODEL_NAME:predict

Stop the image

When you finish testing locally, stop the image:

docker stop tfserving-test

Deploy your online endpoint to Azure

To deploy your online endpoint to Azure, take the steps in the following sections.

Azure CLI
Python SDK

Create YAML files for your endpoint and deployment

You can configure your cloud deployment by using YAML. For instance, to configure your endpoint, you can create a YAML file named tfserving-endpoint.yml that contains the following lines:

$schema: https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.schema.json
name: tfserving-endpoint
auth_mode: aml_token

To configure your deployment, you can create a YAML file named tfserving-deployment.yml that contains the following lines:

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
  name: tfserving-mounted
  version: <model-version>
  path: ./half_plus_two
environment_variables:
  MODEL_BASE_PATH: /var/azureml-app/azureml-models/tfserving-mounted/<model-version>
  MODEL_NAME: half_plus_two
environment:
  #name: tfserving
  #version: 1
  image: docker.io/tensorflow/serving:latest
  inference_config:
    liveness_route:
      port: 8501
      path: /v1/models/half_plus_two
    readiness_route:
      port: 8501
      path: /v1/models/half_plus_two
    scoring_route:
      port: 8501
      path: /v1/models/half_plus_two:predict
instance_type: Standard_DS3_v2
instance_count: 1

Connect to your Azure Machine Learning workspace

To configure your Azure Machine Learning workspace, take the following steps:

Import the required libraries:

# Import the required libraries.
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
   ManagedOnlineEndpoint,
   ManagedOnlineDeployment,
   Model,
   Environment,
   CodeConfiguration,
)
from azure.identity import DefaultAzureCredential

Configure workspace settings and get a handle to the workspace:

# Enter information about your Azure Machine Learning workspace.
subscription_id = "<subscription-ID>"
resource_group = "<resource-group-name>"
workspace = "<Azure-Machine-Learning-workspace-name>"

# Get a handle to the workspace.
ml_client = MLClient(
  DefaultAzureCredential(), subscription_id, resource_group, workspace
)

For more information, see Deploy and score a machine learning model by using an online endpoint.

Configure an online endpoint

Use the following code to configure an online endpoint. Keep the following points in mind:

The name of the endpoint must be unique in its Azure region. Also, an endpoint name must start with a letter and only consist of alphanumeric characters and hyphens. For more information about naming rules, see Azure Machine Learning online endpoints and batch endpoints.
For the auth_mode value, use key for key-based authentication. Use aml_token for Azure Machine Learning token-based authentication. A key doesn't expire, but a token does expire. For more information about authentication, see Authenticate clients for online endpoints.
The description and tags are optional.

# To create a unique endpoint name, use a time stamp of the current date and time.
import datetime

online_endpoint_name = "endpoint-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# Configure an online endpoint.
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="A sample online endpoint",
    auth_mode="key",
    tags={"env": "dev"},
)

Configure an online deployment

A deployment is a set of resources that are required for hosting the model that does the actual inferencing. You can use the ManagedOnlineDeployment class to configure a deployment for your endpoint. The constructor of that class uses the following parameters:

name: The name of the deployment.
endpoint_name: The name of the endpoint to create the deployment under.
model: The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.
environment: The environment to use for the deployment. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification.
environment_variables: Environment variables that are set during deployment.
- MODEL_BASE_PATH: The path to the parent folder that contains a folder for your model.
- MODEL_NAME: The name of your model.
instance_type: The virtual machine size to use for the deployment. For a list of supported sizes, see Managed online endpoints SKU list.
instance_count: The number of instances to use for the deployment.

Use the following code to configure a deployment for your endpoint:

# create a blue deployment
model = Model(name="tfserving-mounted", version="1", path="half_plus_two")

env = Environment(
    image="docker.io/tensorflow/serving:latest",
    inference_config={
        "liveness_route": {"port": 8501, "path": "/v1/models/half_plus_two"},
        "readiness_route": {"port": 8501, "path": "/v1/models/half_plus_two"},
        "scoring_route": {"port": 8501, "path": "/v1/models/half_plus_two:predict"},
    },
)

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=env,
    environment_variables={
        "MODEL_BASE_PATH": "/var/azureml-app/azureml-models/tfserving-mounted/1",
        "MODEL_NAME": "half_plus_two",
    },
    instance_type="Standard_DS2_v2",
    instance_count=1,
)

The following sections discuss important concepts about the YAML and Python parameters.

Base image

In the environment section in YAML, or the Environment constructor in Python, you specify the base image as a parameter. This example uses docker.io/tensorflow/serving:latest as the image value.

If you inspect your container, you can see that this server uses ENTRYPOINT commands to start an entry point script. That script takes environment variables such as MODEL_BASE_PATH and MODEL_NAME, and it exposes ports such as 8501. These details all pertain to this server, and you can use this information to determine how to define your deployment. For example, if you set the MODEL_BASE_PATH and MODEL_NAME environment variables in your deployment definition, TF Serving uses those values to initiate the server. Likewise, if you set the port for each route to 8501 in the deployment definition, user requests to those routes are correctly routed to the TF Serving server.

This example is based on the TF Serving case. But you can use any container that stays up and responds to requests that go to liveness, readiness, and scoring routes. To see how to form a Dockerfile to create a container, you can refer to other examples. Some servers use CMD instructions instead of ENTRYPOINT instructions.

The inference_config parameter

In the environment section or the Environment class, inference_config is a parameter. It specifies the port and path for three types of routes: liveness, readiness, and scoring routes. The inference_config parameter is required if you want to run your own container with a managed online endpoint.

Readiness routes vs. liveness routes

Some API servers provide a way to check the status of the server. There are two types of routes that you can specify for checking the status:

Liveness routes: To check whether a server is running, you use a liveness route.
Readiness routes: To check whether a server is ready to do work, you use a readiness route.

In the context of machine learning inferencing, a server might respond with a status code of 200 OK to a liveness request before loading a model. The server might respond with a status code of 200 OK to a readiness request only after it loads the model into memory.

For more information about liveness and readiness probes, see Configure Liveness, Readiness and Startup Probes.

The API server that you choose determines the liveness and readiness routes. You identify that server in an earlier step when you test the container locally. In this article, the example deployment uses the same path for the liveness and readiness routes, because TF Serving only defines a liveness route. For other ways of defining the routes, see other examples.

Scoring routes

The API server that you use provides a way to receive the payload to work on. In the context of machine learning inferencing, a server receives the input data via a specific route. Identify that route for your API server when you test the container locally in an earlier step. Specify that route as the scoring route when you define the deployment to create.

The successful creation of the deployment also updates the scoring_uri parameter of the endpoint. You can verify this fact by running the following command: az ml online-endpoint show -n <endpoint-name> --query scoring_uri.

Locate the mounted model

When you deploy a model as an online endpoint, Azure Machine Learning mounts your model to your endpoint. When the model is mounted, you can deploy new versions of the model without having to create a new Docker image. By default, a model registered with the name my-model and version 1 is located on the following path inside your deployed container: /var/azureml-app/azureml-models/my-model/1.

For example, consider the following setup:

A directory structure on your local machine of /azureml-examples/cli/endpoints/online/custom-container
A model name of half_plus_two

Screenshot that shows a tree view of a local directory structure. The /azureml-examples/cli/endpoints/online/custom-container path is visible.

Azure CLI
Python SDK

Suppose your tfserving-deployment.yml file contains the following lines in its model section. In this section, the name value refers to the name that you use to register the model in Azure Machine Learning.

model:
    name: tfserving-mounted
    version: 1
    path: ./half_plus_two

Suppose you use the following code to create a Model class. In this code, the name value refers to the name that you use to register the model in Azure Machine Learning.

model = Model(name="tfserving-mounted", version="1", path="half_plus_two")

In this case, when you create a deployment, your model is located under the following folder: /var/azureml-app/azureml-models/tfserving-mounted/1.

Screenshot that shows a tree view of the deployment directory structure. The var/azureml-app/azureml-models/tfserving-mounted/1 path is visible.

You can optionally configure your model_mount_path value. By adjusting this setting, you can change the path where the model is mounted.

Important

The model_mount_path value must be a valid absolute path in Linux (the OS of the container image).

When you change the value of model_mount_path, you also need to update the MODEL_BASE_PATH environment variable. Set MODEL_BASE_PATH to the same value as model_mount_path to avoid a failed deployment due to an error about the base path not being found.

Azure CLI
Python SDK

For example, you can add the model_mount_path parameter to your tfserving-deployment.yml file. You can also update the MODEL_BASE_PATH value in that file:

name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
  name: tfserving-mounted
  version: 1
  path: ./half_plus_two
model_mount_path: /var/tfserving-model-mount
environment_variables:
  MODEL_BASE_PATH: /var/tfserving-model-mount
...

For example, you can add the model_mount_path parameter to your ManagedOnlineDeployment class. You can also update the MODEL_BASE_PATH value in that code:

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=env,
    model_mount_path="/var/tfserving-model-mount",
    environment_variables={
        "MODEL_BASE_PATH": "/var/tfserving-model-mount",
    ...
)

In your deployment, your model is then located at /var/tfserving-model-mount/tfserving-mounted/1. It's no longer under azureml-app/azureml-models, but under the mount path that you specify:

Screenshot that shows a tree view of the deployment directory structure. The /var/tfserving-model-mount/tfserving-mounted/1 path is visible.

After you construct your YAML file, use the following command to create your endpoint:

az ml online-endpoint create --name tfserving-endpoint -f endpoints/online/custom-container/tfserving/half-plus-two/tfserving-endpoint.yml

Use the following command to create your deployment. This step might run for a few minutes.

az ml online-deployment create --name tfserving-deployment -f endpoints/online/custom-container/tfserving/half-plus-two/tfserving-deployment.yml --all-traffic

Use the following code to create the endpoint in the workspace. This code uses the instance of MLClient that you created earlier. The begin_create_or_update method starts the endpoint creation. It then returns a confirmation response while the endpoint creation continues.

ml_client.begin_create_or_update(endpoint)

Create the deployment by running the following code:

ml_client.begin_create_or_update(blue_deployment)

Invoke the endpoint

When your deployment is complete, make a scoring request to the deployed endpoint.

Azure CLI
Python SDK

RESPONSE=$(az ml online-endpoint invoke -n $ENDPOINT_NAME --request-file $BASE_PATH/sample_request.json)

Use the instance of MLClient that you created earlier to get a handle to the endpoint. Then use the invoke method and the following parameters to invoke the endpoint:

endpoint_name: The name of the endpoint
request_file: The file that contains the request data
deployment_name: The name of the deployment to test in the endpoint

For the request data, you can use a sample JSON file from the example repository.

# Test the blue deployment by using some sample data.
response = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="blue",
    request_file="sample_request.json",
)

Delete the endpoint

If you no longer need your endpoint, run the following command to delete it:

Azure CLI
Python SDK

az ml online-endpoint delete --name tfserving-endpoint

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

Share via