Module 8b: How to use bulk runner?

1. Prepare Your Test Questions and Golden Answers

  • Create a spreadsheet with your test questions and golden answers.

  • Your sheet should have columns for:

    • question

    • golden answer

    • citation (if needed)

    • audio file as a Google Drive link if needed

2. Create or Duplicate Your AI Agent

  • You need a AI Agent for each model or prompt you want to test.

  • To create a new AI Agent for a different model (for example, to test Gemini 2.5 Pro vs. GPT 4.1):

    • Go to your existing AI Agent, click "Update," then "Save as new" to duplicate it.

    • Choose the new model (such as Gemini 2.5).

    • Update the name (for example, "Gemini 2.5").

    • Click "Save".

3. Set Up the Bulk Run

  • Link your spreadsheet containing the test questions and golden answers:

    • In the "Input data spreadsheet" section, click "Link" and paste your spreadsheet URL.

    • Click "Import."

  • Once imported, check that your questions and golden answers have loaded correctly.

4. Add Your AI Agents as Workflows

  • In the bulk runner, click "Add workflow."

  • Start typing the name of your AI Agent (for example, "marketing_gooey_support_bot") and select it.

  • Add each AI Agent you want to compare (for example, one for GPT 4.0, one for Gemini 2.5 Pro).

5. Configure the Input and Output Columns

  • Go to "Show all columns."

  • Set "Input prompt" to your question column (e.g., "question").

  • Make sure "Output text," "Run URL," and "Runtime" are checked. They help you with results and debugging.

6. Enable Evaluation Workflow

  • In the "Evaluation workflows" section, enable "Copilot evaluator."

  • This will compare each model's output to your golden answer and score them.

If you only want to run the bulk runner without evaluation, you can delete the evaluator.

7. Start the Bulk Run

  • Click "Run."

  • Gooey.AI will process each question through every selected Copilot/model.

  • For each question and Copilot, you get the generated answer, run URL, runtime, and more.

8. Review and Compare Results

  • In the results sheet:

    • Each row shows the question, the answer from each Copilot, the runtime, and the run URL.

    • At the end, you will see the evaluation scores for each model.

    • The system identifies which model performed the best for each question and overall.

9. Analyze Performance

  • Look at the evaluation scores (for example, 80%, 100%, 60%).

  • Higher scores mean answers closer to the expert-provided golden answer.

  • If a new model scores lower, review the answers and ratings to find areas for improvement.

10. Repeat or Refine

  • You can rerun the evaluation after adjusting prompts, models, or questions.

  • Use the results to decide which model or prompt is best for your use case.

How to add Audio Input in Bulk Evaluations?

Last updated

Was this helpful?