Module 8: Your first evals!
What is a Golden Q&A?
A Golden Q&A is a list of common questions and accurate answers, created by experts on your team. These are used to test your AI Agent's performance.
Bulk Evaluation Process (Overview)
The Golden Q&A sheet is used as the test set in the bulk evaluator.
For each question, the AI Agent generates an answer.
The evaluator compares the AI Agent answer to the expert (golden) answer.
Scores are based on technical accuracy, citation correctness, and answer quality.


Why is bulk run and evaluation important?
Bulk runs and evaluations help you with:
Choosing the right
LLM
TTS
STT
Translations
Improving your overall AI Agent's responses
Assess time vs cost for the choice of the pipeline
Check regressions regularly
Why do you need a bulk runner and evaluations?
When building your Gooey.AI workflows, you will have to tweak the settings often to ensure the responses show parity and are grounded and verifiable.
There are several components to test:
testing prompts
ensuring the synthetic data retrieval works
checking the suitability of the language model and its advanced settings
Latency of generated answers
evaluation of the final AI Agent to produce the Golden Answers
evaluation of the price per run
regression tests
How can you do this at scale?
This is where Gooey.AI’s Bulk and Evaluation features shine!
Features of Bulk Runner and Evaluation
Run several models in one click
Run several iterations of your workflows at scale
Choose any of the API Response Outputs to populate your test
Get output in CSV for further data analysis
Built-in evaluation tool for quick analysis
Use CSV or Google Sheets as input
Last updated
Was this helpful?