Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Code-based scorers are Python functions that you create. Use them when built-in LLM judges and custom LLM judges don't fit your evaluation needs. For example, code-based scorers enable you to:
- Define a custom heuristic or programmatic evaluation metric.
- Customize how trace data is mapped to a Databricks built-in LLM judge.
- Use your own LLM (instead of a Databricks-hosted LLM judge) for evaluation.
- Other use cases where you need more flexibility and control than provided by custom LLM judges.
You can use the same code-based scorer for evaluation in development and monitoring in production.
Choose a definition style
MLflow supports two ways to define a code-based scorer:
| Approach | Use when | Production monitoring |
|---|---|---|
@scorer decorator |
Most cases. Recommended starting point. | Supported (when defined and registered from a Databricks notebook). |
Scorer class |
You need stateful scorers, complex initialization, or Pydantic fields. | Not supported. |
:::note Compatibility with production monitoring
Production monitoring supports built-in LLM judges and @scorer-decorated functions. Class-based Scorer subclasses are not supported for production monitoring. If you need stateful scorers in production, use the @scorer decorator and manage state inside the function body.
@scorer-decorated functions used in production monitoring must be defined and registered from a Databricks notebook. The monitoring service serializes the function code for remote execution, and this serialization requires the notebook environment. For details, see Use custom scorer functions.
:::
Next steps
- Develop code-based scorers — Step through the development workflow for code-based scorers.
- Code-based scorer examples — Worked examples covering common code-based scorer patterns.
- Code-based scorer reference — Reference for
@scorerandScorer, including signatures, inputs, outputs, metric naming, error handling, and accessing secrets. - Evaluate GenAI during development — Understand how
mlflow.genai.evaluate()uses your scorers. - Monitor GenAI apps in production — Deploy scorers for continuous monitoring.