Human feedback

2025-07-21

Human feedback is essential for building high-quality GenAI applications that meet user expectations. MLflow provides tools and a data model to collect, manage, and utilize feedback from developers, end-users, and domain experts.

Data model overview

MLflow stores human feedback as Assessments, attached to individual MLflow Traces. This links feedback directly to a specific user query and your GenAI app's outputs and logic.

There are 2 assessment types:

Feedback: Evaluates your app's actual outputs or intermediate steps. For example, it answers questions like, "Was the agent's response good?". Feedback assesses what the app produced, such as ratings or comments. Feedback assesses what was generated by the app and provides qualitative insights.
Expectation: Defines the desired or correct outcome (ground truth) that your app should have produced. For example, this could be "The ideal response" to a user's query. For a given input, the Expectation is always the same. Expectations define what the app should generate and are useful for creating evaluation datasets,

Assessments can be attached to the entire Trace or a specific span within the Trace.

For more detail about the data model, see Tracing Data Model.

How to collect feedback

MLflow helps you collect feedback from three main sources. Each source is tailored for a different use case in your GenAI app's lifecycle. While feedback comes from differnt personas, the underlying data model is the same for all personas.

Developer feedback

During development, you can directly annotate traces. This is useful to track quality notes as you build and mark specific examples for future reference or regression testing. To learn how to annotate feedback during development, see Labeling during development.

Domain expert feedback and expectations

Engage subject matter experts to provide structured feedback on your app's outputs and expectations about your app's inputs. Their detailed evaluations help define what high-quality, correct responses look like for your specific use case and are invaluable for aligning LLM judges with nuanced business requirements. To learn how to collect domain expert feedback, see Collect domain expert feedback.

End-user feedback

In production, capture feedback from users interacting with your live application. This provides crucial insights into real-world performance, helping you identify problematic queries that need fixing and highlight successful interactions to preserve during future updates. To learn how to collect end-user feedback, see Collecting End User Feedback.

Next steps

Continue your journey with these recommended actions and tutorials.

Label during development - Start annotating traces to track quality during development
Collect domain expert feedback - Set up systematic expert review processes
Vibe check with domain experts - Test your app interactively with experts

Reference guides

Review App - Understand MLflow's human feedback interface.
Labeling Sessions - Learn how expert review sessions work.
Labeling Schemas - Explore feedback structure and types.