Code-based scorers

Code-based scorers are Python functions that you create. Use them when built-in LLM judges and custom LLM judges don't fit your evaluation needs. For example, code-based scorers enable you to:

Define a custom heuristic or programmatic evaluation metric.
Customize how trace data is mapped to a Databricks built-in LLM judge.
Use your own LLM (instead of a Databricks-hosted LLM judge) for evaluation.
Other use cases where you need more flexibility and control than provided by custom LLM judges.

You can use the same code-based scorer for evaluation in development and monitoring in production.

Choose a definition style

MLflow supports two ways to define a code-based scorer:

Approach	Use when	Production monitoring
`@scorer` decorator	Most cases. Recommended starting point.	Supported (when defined and registered from a Databricks notebook).
`Scorer` class	You need stateful scorers, complex initialization, or Pydantic fields.	Not supported.

:::note Compatibility with production monitoring

Production monitoring supports built-in LLM judges and @scorer-decorated functions. Class-based Scorer subclasses are not supported for production monitoring. If you need stateful scorers in production, use the @scorer decorator and manage state inside the function body.

@scorer-decorated functions used in production monitoring must be defined and registered from a Databricks notebook. The monitoring service serializes the function code for remote execution, and this serialization requires the notebook environment. For details, see Use custom scorer functions.

:::

Next steps

Develop code-based scorers — Step through the development workflow for code-based scorers.
Code-based scorer examples — Worked examples covering common code-based scorer patterns.
Code-based scorer reference — Reference for @scorer and Scorer, including signatures, inputs, outputs, metric naming, error handling, and accessing secrets.
Evaluate GenAI during development — Understand how mlflow.genai.evaluate() uses your scorers.
Monitor GenAI apps in production — Deploy scorers for continuous monitoring.

Feedback

Was this page helpful?

Last updated on 2026-05-05

Code-based scorers

Choose a definition style

Next steps

Feedback

Additional resources