Built-in LLM judges

Built-in LLM judges are predefined scorers that use Databricks-hosted LLMs to evaluate common quality dimensions of your GenAI application such as relevance, safety, groundedness, and correctness. Use them when you want to start evaluating quality quickly. For situations where you want more control over your judges, use custom LLM judges or Python (code-based scorers).

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Available judges

Judge	Arguments	Requires ground truth	What it evaluates
`RelevanceToQuery`	`inputs`, `outputs`	No	Is the response directly relevant to the user's request?
`RetrievalRelevance`	`inputs`, `outputs`	No	Is the retrieved context directly relevant to the user's request?
`Safety`	`inputs`, `outputs`	No	Is the content free from harmful, offensive, or toxic material?
`RetrievalGroundedness`	`inputs`, `outputs`	No	Is the response grounded in the information provided in the context? Is the agent hallucinating?
`Correctness`	`inputs`, `outputs`, `expectations`	Yes	Is the response correct as compared to the provided ground truth?
`RetrievalSufficiency`	`inputs`, `outputs`, `expectations`	Yes	Does the context provide all necessary information to generate a response that includes the ground truth facts?
`Guidelines`	`inputs`, `outputs`	No	Does the response meet specified natural language criteria?
`ExpectationsGuidelines`	`inputs`, `outputs`, `expectations`	No (but needs guidelines in expectations)	Does the response meet per-example natural language criteria?
`ToolCallCorrectness`	`inputs`, `outputs`, `expectations`	Yes	Are the tool calls and arguments correct for the user query?
`ToolCallEfficiency`	`inputs`, `outputs`	No	Are the tool calls efficient without redundancy?

Multi-turn judges

For conversational AI systems, MLflow provides judges that evaluate entire conversations rather than individual turns. These judges analyze the complete conversation history to assess quality patterns that emerge over multiple interactions.

Use multi-turn judges both for evaluation during development and for monitoring in production.

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Judge	Arguments	Requires ground truth	What it evaluates
`ConversationCompleteness`	`session`	No	Did the agent address all user questions throughout the conversation?
`UserFrustration`	`session`	No	Did the user become frustrated? Was the frustration resolved?
`KnowledgeRetention`	`session`	No	Does the agent correctly retain information from earlier in the conversation?
`ConversationalGuidelines`	`session`, `guidelines`	No	Do the assistant's responses comply with provided guidelines throughout the conversation?
`ConversationalRoleAdherence`	`session`	No	Does the assistant maintain its assigned role throughout the conversation?
`ConversationalSafety`	`session`	No	Are the assistant's responses safe and free of harmful content?
`ConversationalToolCallEfficiency`	`session`	No	Was tool usage across the conversation efficient and appropriate?

Next steps

Choose the LLM that powers a judge
Build a custom LLM judge when built-in judges don't fit your use case
Align judges with human feedback to improve accuracy on your domain

Feedback

Was this page helpful?

Last updated on 2026-05-05