Built-in LLM judges

Built-in LLM judges are predefined scorers that use Databricks-hosted LLMs to evaluate common quality dimensions of your GenAI application such as relevance, safety, groundedness, and correctness. Use them when you want to start evaluating quality quickly. For situations where you want more control over your judges, use custom LLM judges or Python (code-based scorers).

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Available judges

Judge Arguments Requires ground truth What it evaluates
RelevanceToQuery inputs, outputs No Is the response directly relevant to the user's request?
RetrievalRelevance inputs, outputs No Is the retrieved context directly relevant to the user's request?
Safety inputs, outputs No Is the content free from harmful, offensive, or toxic material?
RetrievalGroundedness inputs, outputs No Is the response grounded in the information provided in the context? Is the agent hallucinating?
Correctness inputs, outputs, expectations Yes Is the response correct as compared to the provided ground truth?
RetrievalSufficiency inputs, outputs, expectations Yes Does the context provide all necessary information to generate a response that includes the ground truth facts?
Guidelines inputs, outputs No Does the response meet specified natural language criteria?
ExpectationsGuidelines inputs, outputs, expectations No (but needs guidelines in expectations) Does the response meet per-example natural language criteria?
ToolCallCorrectness inputs, outputs, expectations Yes Are the tool calls and arguments correct for the user query?
ToolCallEfficiency inputs, outputs No Are the tool calls efficient without redundancy?

Multi-turn judges

For conversational AI systems, MLflow provides judges that evaluate entire conversations rather than individual turns. These judges analyze the complete conversation history to assess quality patterns that emerge over multiple interactions.

Use multi-turn judges both for evaluation during development and for monitoring in production.

For the complete list and detailed documentation, see the MLflow predefined scorers documentation.

Judge Arguments Requires ground truth What it evaluates
ConversationCompleteness session No Did the agent address all user questions throughout the conversation?
UserFrustration session No Did the user become frustrated? Was the frustration resolved?
KnowledgeRetention session No Does the agent correctly retain information from earlier in the conversation?
ConversationalGuidelines session, guidelines No Do the assistant's responses comply with provided guidelines throughout the conversation?
ConversationalRoleAdherence session No Does the assistant maintain its assigned role throughout the conversation?
ConversationalSafety session No Are the assistant's responses safe and free of harmful content?
ConversationalToolCallEfficiency session No Was tool usage across the conversation efficient and appropriate?

Next steps