# How to use bulk runner?

{% embed url="<https://youtu.be/2_k3Zg4Z1Rg>" %}

**1. Prepare Your Test Questions and Golden Answers**

* Create a spreadsheet with your test questions and golden answers.
* Your sheet should have columns for:&#x20;
  * question
  * golden answer
  * citation (if needed)
  * audio file as a Google Drive link if needed

<figure><img src="https://2450152260-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FNWqgWAjD0VVJgjYDpsN5%2Fuploads%2FNkccGEpwsc4L9SBPBIfX%2FScreenshot%202025-05-21%20at%205.25.20%E2%80%AFPM.png?alt=media&#x26;token=40f00810-ab77-44e5-9528-431afba6ac18" alt=""><figcaption></figcaption></figure>

**2. Create or Duplicate Your AI Agent**

* You need a AI Agent for each model or prompt you want to test.
* To create a new AI Agent for a different model (for example, to test Gemini 2.5 Pro vs. GPT 4.1):
  * Go to your existing AI Agent, click "Update," then "Save as new" to duplicate it.
  * Choose the new model (such as Gemini 2.5).
  * Update the name (for example, "Gemini 2.5").
  * Click "Save".

**3. Set Up the Bulk Run**

* Go to [gooey.ai/bulk](https://gooey.ai/bulk).
* Link your spreadsheet containing the test questions and golden answers:
  * In the "Input data spreadsheet" section, click "Link" and paste your spreadsheet URL.
  * Click "Import."
* Once imported, check that your questions and golden answers have loaded correctly.

<figure><img src="https://2450152260-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FNWqgWAjD0VVJgjYDpsN5%2Fuploads%2FqyHBDtzm1AEM55FLoNdw%2FScreenshot%202025-05-21%20at%205.19.38%E2%80%AFPM.png?alt=media&#x26;token=3d912c25-1838-45d5-a6d7-8ec669a53a47" alt=""><figcaption></figcaption></figure>

**4. Add Your AI Agents as Workflows**

* In the bulk runner, click "Add workflow."
* Start typing the name of your AI Agent (for example, "marketing\_gooey\_support\_bot") and select it.
* Add each AI Agent you want to compare (for example, one for GPT 4.0, one for Gemini 2.5 Pro).

<figure><img src="https://2450152260-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FNWqgWAjD0VVJgjYDpsN5%2Fuploads%2FZYhOHiaIHbOfwarQfDgj%2FScreenshot%202025-05-21%20at%205.19.00%E2%80%AFPM.png?alt=media&#x26;token=5ead66f6-a5eb-4537-97f8-da624220a857" alt=""><figcaption></figcaption></figure>

**5. Configure the Input and Output Columns**

* Go to "Show all columns."
* Set "Input prompt" to your question column (e.g., "question").
* Make sure "Output text," "Run URL," and "Runtime" are checked. They help you with results and debugging.

<figure><img src="https://2450152260-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FNWqgWAjD0VVJgjYDpsN5%2Fuploads%2F9GoyaLs1EfNIpnvXVwel%2FScreenshot%202025-05-21%20at%205.20.05%E2%80%AFPM.png?alt=media&#x26;token=45a886d0-13bc-43ae-ba32-48f3adc83945" alt=""><figcaption></figcaption></figure>

**6. Enable Evaluation Workflow**

* In the "Evaluation workflows" section, enable "Copilot evaluator."
* This will compare each model's output to your golden answer and score them.

<figure><img src="https://2450152260-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FNWqgWAjD0VVJgjYDpsN5%2Fuploads%2FGRqXNbdFsbyRpyERPErY%2FScreenshot%202025-05-21%20at%205.30.31%E2%80%AFPM.png?alt=media&#x26;token=5826ec88-4a1f-4434-aaef-b677d196ef21" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
If you only want to run the bulk runner without evaluation, you can delete the evaluator.
{% endhint %}

**7. Start the Bulk Run**

* Click "Run."
* Gooey.AI will process each question through every selected Copilot/model.
* For each question and Copilot, you get the generated answer, run URL, runtime, and more.

**8. Review and Compare Results**

* In the results sheet:
  * Each row shows the question, the answer from each Copilot, the runtime, and the run URL.
  * At the end, you will see the evaluation scores for each model.
  * The system identifies which model performed the best for each question and overall.

**9. Analyze Performance**

* Look at the evaluation scores (for example, 80%, 100%, 60%).
* Higher scores mean answers closer to the expert-provided golden answer.
* If a new model scores lower, review the answers and ratings to find areas for improvement.

**10. Repeat or Refine**

* You can rerun the evaluation after adjusting prompts, models, or questions.
* Use the results to decide which model or prompt is best for your use case.

### How to add Audio Input in Bulk Evaluations?

{% embed url="<https://youtu.be/w1mKxxIWrRc>" %}
