Evaluation of generative AI applications

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In the rapidly evolving landscape of artificial intelligence, the integration of Generative AI Operations (GenAIOps) is transforming how organizations develop and deploy AI applications. As businesses increasingly rely on AI to enhance decision-making, improve customer experiences, and drive innovation, the importance of a robust evaluation framework can't be overstated. Evaluation is an essential component of the generative AI lifecycle to build confidence and trust in AI-centric applications. If not designed carefully, these applications can produce outputs that are fabricated and ungrounded in context, irrelevant or incoherent, resulting in poor customer experiences, or worse, perpetuate societal stereotypes, promote misinformation, expose organizations to malicious attacks, or a wide range of other negative impacts.

Evaluators are helpful tools to assess the frequency and severity of content risks or undesirable behavior in AI responses. Performing iterative, systematic evaluations with the right evaluators can help teams measure and address potential response quality, safety, or security concerns throughout the AI development lifecycle, from initial model selection through post-production monitoring. Evaluation within the GenAI Ops Lifecycle production.

Diagram of enterprise GenAIOps lifecycle, showing model selection, building an AI application, and operationalizing.

By understanding and implementing effective evaluation strategies at each stage, organizations can ensure their AI solutions not only meet initial expectations but also adapt and thrive in real-world environments. Let's dive into how evaluation fits into the three critical stages of the AI lifecycle

Base model selection

The first stage of the AI lifecycle involves selecting an appropriate base model. Generative AI models vary widely in terms of capabilities, strengths, and limitations, so it's essential to identify which model best suits your specific use case. During base model evaluation, you "shop around" to compare different models by testing their outputs against a set of criteria relevant to your application.

Key considerations at this stage might include:

  • Accuracy/quality: How well does the model generate relevant and coherent responses?
  • Performance on specific tasks: Can the model handle the type of prompts and content required for your use case? How is its latency and cost?
  • Bias and ethical considerations: Does the model produce any outputs that might perpetuate or promote harmful stereotypes?
  • Risk and safety: Are there any risks of the model generating unsafe or malicious content?

You can explore Azure AI Foundry benchmarksto evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see Evaluate model endpoints sample.

Pre-production evaluation

After selecting a base model, the next step is to develop an AI application—such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. Following development, pre-production evaluation begins. Before deploying the application in a production environment, rigorous testing is essential to ensure the model is truly ready for real-world use.

Diagram of pre-production evaluation for models and applications with the six steps.

Pre-production evaluation involves:

  • Testing with evaluation datasets: These datasets simulate realistic user interactions to ensure the AI application performs as expected.
  • Identifying edge cases: Finding scenarios where the AI application’s response quality might degrade or produce undesirable outputs.
  • Assessing robustness: Ensuring that the model can handle a range of input variations without significant drops in quality or safety.
  • Measuring key metrics: Metrics such as response groundedness, relevance, and safety are evaluated to confirm readiness for production.

The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that doesn't meet the desired performance or safety standards.

Alternatively, you can also use Azure AI Foundry’s evaluation widget for testing your generative AI applications.

Once satisfactory results are achieved, the AI application can be deployed to production.

Post-production monitoring

After deployment, the AI application enters the post-production evaluation phase, also known as online evaluation or monitoring. At this stage, the model is embedded within a real-world product and responds to actual user queries. Monitoring ensures that the model continues to behave as expected and adapts to any changes in user behavior or content.

  • Ongoing performance tracking: Regularly measuring AI application’s response using key metrics to ensure consistent output quality.
  • Incident response: Quickly responding to any harmful, unfair, or inappropriate outputs that might arise during real-world use.

By continuously monitoring the AI application’s behavior in production, you can maintain high-quality user experiences and swiftly address any issues that surface.

Conclusion

GenAIOps is all about establishing a reliable and repeatable process for managing generative AI applications across their lifecycle. Evaluation plays a vital role at each stage, from base model selection, through pre-production testing, to ongoing post-production monitoring. By systematically measuring and addressing risks and refining AI systems at every step, teams can build generative AI solutions that aren't only powerful but also trustworthy and safe for real-world use.

Cheat sheet:

Purpose Process Parameters
What are you evaluating for? Identify or build relevant evaluators - Quality and performance ( Quality and performance sample notebook)

- Safety and Security (Safety and Security sample notebook)

- Custom (Custom sample notebook)
What data should you use? Upload or generate relevant dataset Generic simulator for measuring Quality and Performance (Generic simulator sample notebook)

- Adversarial simulator for measuring Safety and Security (Adversarial simulator sample notebook)
What resources should conduct the evaluation? Run evaluation - Local run

- Remote cloud run
How did my model/app perform? Analyze results View aggregate scores, view details, score details, compare evaluation runs
How can I improve? Make changes to model, app, or evaluators - If evaluation results didn't align to human feedback, adjust your evaluator.

- If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations.