Module 8: Your first evals!

What is a Golden Q&A?

A Golden Q&A is a list of common questions and accurate answers, created by experts on your team. These are used to test your AI Agent's performance.

Bulk Evaluation Process (Overview)

  • The Golden Q&A sheet is used as the test set in the bulk evaluator.

  • For each question, the AI Agent generates an answer.

  • The evaluator compares the AI Agent answer to the expert (golden) answer.

  • Scores are based on technical accuracy, citation correctness, and answer quality.

Why is bulk run and evaluation important?

Bulk runs and evaluations help you with:

  • Choosing the right

    • LLM

    • TTS

    • STT

    • Translations

  • Improving your overall AI Agent's responses

  • Assess time vs cost for the choice of the pipeline

  • Check regressions regularly

Why do you need a bulk runner and evaluations?

When building your Gooey.AI workflows, you will have to tweak the settings often to ensure the responses show parity and are grounded and verifiable.

There are several components to test:

  • testing prompts

  • ensuring the synthetic data retrieval works

  • checking the suitability of the language model and its advanced settings

  • Latency of generated answers

  • evaluation of the final AI Agent to produce the Golden Answers

  • evaluation of the price per run

  • regression tests

How can you do this at scale?

This is where Gooey.AI’s Bulk and Evaluation features shine!

Features of Bulk Runner and Evaluation

  • Run several models in one click

  • Run several iterations of your workflows at scale

  • Choose any of the API Response Outputs to populate your test

  • Get output in CSV for further data analysis

  • Built-in evaluation tool for quick analysis

  • Use CSV or Google Sheets as input

Last updated

Was this helpful?