Review App

2025-06-11

The Review App is a web-based UI designed for collecting structured feedback from domain experts without requiring them to write code. Use it to gather insights that improve your GenAI app's quality and align LLM judges with business requirements.

Review app preview hero image.

Two ways to use Review App

Label existing traces

Ask experts to review existing interactions with your app to provide feedback and expectations.

Use this to:

Understand what high-quality, correct responses look like for specific queries
Collect input to align LLM judges with your business requirements
Create evaluation datasets from production traces

Vibe check a pre-production app

To use vibe check mode, you must have your application deployed

Ask experts to chat with a deployed app and provide feedback on the app's responses in real-time.

Use this to:

Get quick feedback on new app versions before deployment
Test app behavior without impacting your production environment
Validate improvements with domain experts

Mode comparison

Aspect	Label existing traces	Vibe check mode
Input source	Existing traces	Domain expert enters queries
Output source	Existing traces	Live agent endpoint responses
Custom labeling schema	✅ Yes - define custom questions and criteria	❌ No - uses fixed feedback questions
Results stored in	MLflow Traces (inside a Labeling Session)	MLflow Traces

Prerequisites

Install MLflow and required packages

pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"

Create an MLflow experiment by following the setup your environment quickstart.
For vibe check mode only: a deployed agent endpoint using Agent Framework

1. Labeling existing traces

Labeling existing traces allows you to collect structured feedback on traces you've already captured from production or development. This is ideal for building evaluation datasets, understanding quality patterns, and training custom LLM judges.

The process involves creating a labeling session, defining what feedback to collect, adding traces to review, and sharing with domain experts. For complete step-by-step instructions, see Label existing traces.

For detailed information about labeling sessions, schemas, and configuration options, see Labeling Sessions and Labeling Schemas.

2. Vibe check mode

Package your app using Agent Framework and deploy it using Agent Framework as a Model Serving endpoint.

Add the endpoint to your experiment's review app:

Note

The below example adds a Databricks hosted LLM to the review app. Replace the endpoint with your app's endpoint from step 1.

from mlflow.genai.labeling import get_review_app

# Get review app for current MLflow experiment
review_app = get_review_app()

# Connect your deployed agent endpoint
review_app.add_agent(
    agent_name="claude-sonnet",
    model_serving_endpoint="databricks-claude-3-7-sonnet",
)

print(f"Share this URL: {review_app.url}/chat")

Domain experts can now chat with your app and provide immediate feedback.

Permissions model

For labeling existing traces

Domain experts need:

Account access: Must be provisioned in your Databricks account, but do not need access to your workspace
Experiment access: WRITE permission to the MLflow experiment

For vibe check mode

Domain experts need:

Account access: Must be provisioned in your Databricks account, , but do not need access to your workspace
Endpoint access: CAN_QUERY permission to the model serving endpoint

Setting up account access

For users without workspace access, account admins can:

Use account-level SCIM provisioning to sync users from your identity provider
Manually register users and groups in Databricks

See User and group management for details.

Content rendering

The Review App automatically renders different content types from your MLflow Trace:

Retrieved documents: Documents within a RETRIEVER span are rendered for display
OpenAI format messages: Inputs and outputs of the MLflow Trace following OpenAI chat conversations are rendered:
- outputs that contain an OpenAI format ChatCompletions object
- inputs or outputs dicts that contain a messages key with an array of OpenAI format chat messages
  - If the messages array contains OpenAI format tool calls, they are also rendered
Dictionaries: Inputs and outputs of the MLflow Trace that are dicts are rendered as pretty-printed JSONs

Otherwise, the content of the input and output from the root span of each trace are used as the primary content for review.

Accessing feedback data

After experts provide feedback, the labels are stored in MLflow Traces in your Experiment. Use the Traces tab or Labeling Sessions tab to view the data.

Next Steps

Label existing traces - Step-by-step guide to collect structured expert feedback
Live app testing - Set up vibe check mode for pre-production testing
Build evaluation datasets - Convert expert feedback into evaluation datasets

Share via