How to use bulk runner?

1. Prepare Your Test Questions and Golden Answers

Create a spreadsheet with your test questions and golden answers.
Your sheet should have columns for:
- question
- golden answer
- citation (if needed)
- audio file as a Google Drive link if needed

2. Create or Duplicate Your AI Agent

You need a AI Agent for each model or prompt you want to test.
To create a new AI Agent for a different model (for example, to test Gemini 2.5 Pro vs. GPT 4.1):
- Go to your existing AI Agent, click "Update," then "Save as new" to duplicate it.
- Choose the new model (such as Gemini 2.5).
- Update the name (for example, "Gemini 2.5").
- Click "Save".

3. Set Up the Bulk Run

Go to gooey.ai/bulk.
Link your spreadsheet containing the test questions and golden answers:
- In the "Input data spreadsheet" section, click "Link" and paste your spreadsheet URL.
- Click "Import."
Once imported, check that your questions and golden answers have loaded correctly.

4. Add Your AI Agents as Workflows

In the bulk runner, click "Add workflow."
Start typing the name of your AI Agent (for example, "marketing_gooey_support_bot") and select it.
Add each AI Agent you want to compare (for example, one for GPT 4.0, one for Gemini 2.5 Pro).

5. Configure the Input and Output Columns

Go to "Show all columns."
Set "Input prompt" to your question column (e.g., "question").
Make sure "Output text," "Run URL," and "Runtime" are checked. They help you with results and debugging.

6. Enable Evaluation Workflow

If you only want to run the bulk runner without evaluation, you can delete the evaluator.

7. Start the Bulk Run

Click "Run."
Gooey.AI will process each question through every selected Copilot/model.
For each question and Copilot, you get the generated answer, run URL, runtime, and more.

8. Review and Compare Results

In the results sheet:
- Each row shows the question, the answer from each Copilot, the runtime, and the run URL.
- At the end, you will see the evaluation scores for each model.
- The system identifies which model performed the best for each question and overall.

9. Analyze Performance

Look at the evaluation scores (for example, 80%, 100%, 60%).
Higher scores mean answers closer to the expert-provided golden answer.
If a new model scores lower, review the answers and ratings to find areas for improvement.

10. Repeat or Refine

Last updated 3 days ago

Was this helpful?