How to use bulk runner?

1. Prepare Your Test Questions and Golden Answers

  • Create a spreadsheet with your test questions and golden answers.

  • Your sheet should have columns for:

    • question

    • golden answer

    • citation (if needed)

    • audio file as a Google Drive link if needed

2. Create or Duplicate Your AI Agent

  • You need a AI Agent for each model or prompt you want to test.

  • To create a new AI Agent for a different model (for example, to test Gemini 2.5 Pro vs. GPT 4.1):

    • Go to your existing AI Agent, click "Update," then "Save as new" to duplicate it.

    • Choose the new model (such as Gemini 2.5).

    • Update the name (for example, "Gemini 2.5").

    • Click "Save".

3. Set Up the Bulk Run

  • Link your spreadsheet containing the test questions and golden answers:

    • In the "Input data spreadsheet" section, click "Link" and paste your spreadsheet URL.

    • Click "Import."

  • Once imported, check that your questions and golden answers have loaded correctly.

4. Add Your AI Agents as Workflows

  • In the bulk runner, click "Add workflow."

  • Start typing the name of your AI Agent (for example, "marketing_gooey_support_bot") and select it.

  • Add each AI Agent you want to compare (for example, one for GPT 4.0, one for Gemini 2.5 Pro).

5. Configure the Input and Output Columns

  • Go to "Show all columns."

  • Set "Input prompt" to your question column (e.g., "question").

  • Make sure "Output text," "Run URL," and "Runtime" are checked. They help you with results and debugging.

6. Enable Evaluation Workflow

  • In the "Evaluation workflows" section, enable "Copilot evaluator."

  • This will compare each model's output to your golden answer and score them.

circle-info

If you only want to run the bulk runner without evaluation, you can delete the evaluator.

7. Start the Bulk Run

  • Click "Run."

  • Gooey.AI will process each question through every selected Copilot/model.

  • For each question and Copilot, you get the generated answer, run URL, runtime, and more.

8. Review and Compare Results

  • In the results sheet:

    • Each row shows the question, the answer from each Copilot, the runtime, and the run URL.

    • At the end, you will see the evaluation scores for each model.

    • The system identifies which model performed the best for each question and overall.

9. Analyze Performance

  • Look at the evaluation scores (for example, 80%, 100%, 60%).

  • Higher scores mean answers closer to the expert-provided golden answer.

  • If a new model scores lower, review the answers and ratings to find areas for improvement.

10. Repeat or Refine

  • You can rerun the evaluation after adjusting prompts, models, or questions.

  • Use the results to decide which model or prompt is best for your use case.

How to add Audio Input in Bulk Evaluations?

Last updated

Was this helpful?