Module 8: Your first evals!

A Golden Q&A is a list of common questions and accurate answers, created by experts on your team. These are used to test your AI Agent's performance.

The Golden Q&A sheet is used as the test set in the bulk evaluator.
For each question, the AI Agent generates an answer.
The evaluator compares the AI Agent answer to the expert (golden) answer.
Scores are based on technical accuracy, citation correctness, and answer quality.

Why is bulk run and evaluation important?

Bulk runs and evaluations help you with:

When building your Gooey.AI workflows, you will have to tweak the settings often to ensure the responses show parity and are grounded and verifiable.

There are several components to test:

How can you do this at scale?

This is where Gooey.AI’s Bulk and Evaluation features shine!

Common terms in bulk and evaluation

Last updated 1 month ago

Was this helpful?