Module 8b: How to use bulk runner?
1. Prepare Your Test Questions and Golden Answers
Create a spreadsheet with your test questions and golden answers.
Your sheet should have columns for:
question
golden answer
citation (if needed)
audio file as a Google Drive link if needed

2. Create or Duplicate Your AI Agent
You need a AI Agent for each model or prompt you want to test.
To create a new AI Agent for a different model (for example, to test Gemini 2.5 Pro vs. GPT 4.1):
Go to your existing AI Agent, click "Update," then "Save as new" to duplicate it.
Choose the new model (such as Gemini 2.5).
Update the name (for example, "Gemini 2.5").
Click "Save".
3. Set Up the Bulk Run
Go to gooey.ai/bulk.
Link your spreadsheet containing the test questions and golden answers:
In the "Input data spreadsheet" section, click "Link" and paste your spreadsheet URL.
Click "Import."
Once imported, check that your questions and golden answers have loaded correctly.

4. Add Your AI Agents as Workflows
In the bulk runner, click "Add workflow."
Start typing the name of your AI Agent (for example, "marketing_gooey_support_bot") and select it.
Add each AI Agent you want to compare (for example, one for GPT 4.0, one for Gemini 2.5 Pro).

5. Configure the Input and Output Columns
Go to "Show all columns."
Set "Input prompt" to your question column (e.g., "question").
Make sure "Output text," "Run URL," and "Runtime" are checked. They help you with results and debugging.

6. Enable Evaluation Workflow
In the "Evaluation workflows" section, enable "Copilot evaluator."
This will compare each model's output to your golden answer and score them.

7. Start the Bulk Run
Click "Run."
Gooey.AI will process each question through every selected Copilot/model.
For each question and Copilot, you get the generated answer, run URL, runtime, and more.
8. Review and Compare Results
In the results sheet:
Each row shows the question, the answer from each Copilot, the runtime, and the run URL.
At the end, you will see the evaluation scores for each model.
The system identifies which model performed the best for each question and overall.
9. Analyze Performance
Look at the evaluation scores (for example, 80%, 100%, 60%).
Higher scores mean answers closer to the expert-provided golden answer.
If a new model scores lower, review the answers and ratings to find areas for improvement.
10. Repeat or Refine
You can rerun the evaluation after adjusting prompts, models, or questions.
Use the results to decide which model or prompt is best for your use case.
How to add Audio Input in Bulk Evaluations?
Last updated
Was this helpful?