Code-based scorers

Code-based scorers are Python functions that you create. Use them when built-in LLM judges and custom LLM judges don't fit your evaluation needs. For example, code-based scorers enable you to:

  • Define a custom heuristic or programmatic evaluation metric.
  • Customize how trace data is mapped to a Databricks built-in LLM judge.
  • Use your own LLM (instead of a Databricks-hosted LLM judge) for evaluation.
  • Other use cases where you need more flexibility and control than provided by custom LLM judges.

You can use the same code-based scorer for evaluation in development and monitoring in production.

Choose a definition style

MLflow supports two ways to define a code-based scorer:

Approach Use when Production monitoring
@scorer decorator Most cases. Recommended starting point. Supported (when defined and registered from a Databricks notebook).
Scorer class You need stateful scorers, complex initialization, or Pydantic fields. Not supported.

:::note Compatibility with production monitoring

Production monitoring supports built-in LLM judges and @scorer-decorated functions. Class-based Scorer subclasses are not supported for production monitoring. If you need stateful scorers in production, use the @scorer decorator and manage state inside the function body.

@scorer-decorated functions used in production monitoring must be defined and registered from a Databricks notebook. The monitoring service serializes the function code for remote execution, and this serialization requires the notebook environment. For details, see Use custom scorer functions.

:::

Next steps