Skip to contents

1. Overview

This vignette demonstrates how to use pairwiseLLM for Batch API workflows (server-side batching), which are distinct from the live API calls described in the Getting Started vignette.

Batch workflows are ideal for large-scale jobs because they:

  • Allow submitting thousands of pairs at once
  • Are often cheaper (e.g., discounted batch pricing on some providers)
  • Avoid client-side timeout and connection issues
  • Can be polled and resumed even if your local R session ends

Supported Batch API providers:

Recommended approach: For multiple batches (e.g., templates × providers × models × forward/reverse), use:

These helpers orchestrate the provider-specific pipelines without forcing you to write your own polling loops.

Note: Together.ai and Ollama do not currently support a native Batch API compatible with this workflow. For those providers, use the live API wrapper submit_llm_pairs() as described in the Getting Started vignette.

In this vignette, we will cover:

  • Designing a grid of provider/model/thinking/direction combinations
  • Submitting many batch jobs using the multi-batch helpers
  • Polling and resuming safely via on-disk registries
  • Producing per-run and merged results tables

Note: All heavy API calls in this vignette are set to eval = FALSE so that the vignette remains CRAN-safe. You can enable them in your own project.

For basic function usage, see the companion vignette:

For prompt evaluation and positional-bias diagnostics, see the companion vignette:

2. Setup and API Keys

Required environment variables:

Provider Environment Variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY

Check which are set:

check_llm_api_keys()
#> No LLM API keys are currently set for known backends:
#>   - OpenAI:         OPENAI_API_KEY
#>   - Anthropic:      ANTHROPIC_API_KEY
#>   - Google Gemini:  GEMINI_API_KEY
#>   - Together.ai:    TOGETHER_API_KEY
#> 
#> Use `usethis::edit_r_environ()` to add the keys persistently, e.g.:
#>   OPENAI_API_KEY    = "YOUR_OPENAI_KEY_HERE"
#>   ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_KEY_HERE"
#>   GEMINI_API_KEY    = "YOUR_GEMINI_KEY_HERE"
#>   TOGETHER_API_KEY  = "YOUR_TOGETHER_KEY_HERE"
#> # A tibble: 4 × 4
#>   backend   service       env_var           has_key
#>   <chr>     <chr>         <chr>             <lgl>  
#> 1 openai    OpenAI        OPENAI_API_KEY    FALSE  
#> 2 anthropic Anthropic     ANTHROPIC_API_KEY FALSE  
#> 3 gemini    Google Gemini GEMINI_API_KEY    FALSE  
#> 4 together  Together.ai   TOGETHER_API_KEY  FALSE

3. Example Data and Prompt Template

We use the built-in writing samples and a single trait (overall_quality).

data("example_writing_samples", package = "pairwiseLLM")

td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#> 
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n      how clearly the writing is organized, and how effective the language and\n      conventions are."

Default prompt template:

tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 400), "...
")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#> 
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#> 
#> SAMPLES:
#> 
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#> 
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#> 
#> EVALUATION PROCESS (Mental Simulation):
#> 
#> 1.  **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the  ...

Construct a modest number of pairs to keep the example light:

set.seed(123)

pairs_all <- example_writing_samples |>
  make_pairs()

n_pairs <- min(40L, nrow(pairs_all))

pairs_forward <- pairs_all |>
  sample_pairs(n_pairs = n_pairs, seed = 123) |>
  randomize_pair_order(seed = 456)

pairs_reverse <- sample_reverse_pairs(
  pairs_forward,
  reverse_pct = 1.0,
  seed        = 789
)

get_pairs_for_direction <- function(direction = c("forward", "reverse")) {
  direction <- match.arg(direction)
  if (identical(direction, "forward")) {
    pairs_forward
  } else {
    pairs_reverse
  }
}

4. Designing the Batch Grid

Suppose we want to test several prompt templates across:

  • Anthropic models (with/without “thinking”)
  • OpenAI models (with/without “thinking” for specific models)
  • Gemini models (with “thinking” enabled)

Here we define a small grid:

anthropic_models <- c(
  "claude-sonnet-4-5",
  "claude-haiku-4-5",
  "claude-opus-4-5"
)

gemini_models <- c(
  "gemini-3-pro-preview"
)

openai_models <- c(
  "gpt-4.1",
  "gpt-4o",
  "gpt-5.1"
)

thinking_levels <- c("no_thinking", "with_thinking")
directions <- c("forward", "reverse")

anthropic_grid <- tidyr::expand_grid(
  provider  = "anthropic",
  model     = anthropic_models,
  thinking  = thinking_levels,
  direction = directions
)

gemini_grid <- tidyr::expand_grid(
  provider  = "gemini",
  model     = gemini_models,
  thinking  = "with_thinking",
  direction = directions
)

openai_grid <- tidyr::expand_grid(
  provider  = "openai",
  model     = openai_models,
  thinking  = thinking_levels,
  direction = directions
) |>
  # For example, only allow "with_thinking" for gpt-5.1
  dplyr::filter(model == "gpt-5.1" | thinking == "no_thinking")

batch_grid <- dplyr::bind_rows(
  anthropic_grid,
  gemini_grid,
  openai_grid
)

batch_grid
#> # A tibble: 22 × 4
#>    provider  model             thinking      direction
#>    <chr>     <chr>             <chr>         <chr>    
#>  1 anthropic claude-sonnet-4-5 no_thinking   forward  
#>  2 anthropic claude-sonnet-4-5 no_thinking   reverse  
#>  3 anthropic claude-sonnet-4-5 with_thinking forward  
#>  4 anthropic claude-sonnet-4-5 with_thinking reverse  
#>  5 anthropic claude-haiku-4-5  no_thinking   forward  
#>  6 anthropic claude-haiku-4-5  no_thinking   reverse  
#>  7 anthropic claude-haiku-4-5  with_thinking forward  
#>  8 anthropic claude-haiku-4-5  with_thinking reverse  
#>  9 anthropic claude-opus-4-5   no_thinking   forward  
#> 10 anthropic claude-opus-4-5   no_thinking   reverse  
#> # ℹ 12 more rows

We will also imagine multiple prompt templates have been registered. For simplicity, we use the same tmpl string, but in practice you would substitute different text:

templates_tbl <- tibble::tibble(
  template_id     = c("test1", "test2", "test3", "test4", "test5"),
  prompt_template = list(tmpl, tmpl, tmpl, tmpl, tmpl)
)

templates_tbl
#> # A tibble: 5 × 2
#>   template_id prompt_template
#>   <chr>       <list>         
#> 1 test1       <chr [1]>      
#> 2 test2       <chr [1]>      
#> 3 test3       <chr [1]>      
#> 4 test4       <chr [1]>      
#> 5 test5       <chr [1]>

5. Submitting Many Batches with the Multi‑Batch Helpers

The key idea is:

  • Each combination of (template_id, provider, model, thinking, direction) becomes a run
  • Each run writes its files into its own subdirectory (so file names never collide)
  • Within each run you can still split into multiple segments using batch_size or n_segments

5.1 Create a run plan and output directory

out_root <- "dev-output/advanced-multi-batch"
dir.create(out_root, recursive = TRUE, showWarnings = FALSE)

run_plan <- tidyr::crossing(
  templates_tbl |> tidyr::unnest(prompt_template),
  batch_grid
) |>
  mutate(
    run_id = paste(template_id, provider, model, thinking, direction, sep = "__"),
    run_id = gsub("[^A-Za-z0-9_.-]+", "-", run_id),
    run_dir = file.path(out_root, run_id)
  )

run_plan |> dplyr::select(run_id, template_id, provider, model, thinking, direction, run_dir)

5.2 Submit all runs (no polling)

Below we submit each run using llm_submit_pairs_multi_batch(). This returns a jobs list and writes a jobs_registry.csv under each run directory (because write_registry = TRUE).

Provider-specific options can be forwarded via .... In the example below we:

  • Enable “thinking” output where applicable
  • Ask providers to include raw outputs in addition to parsed tags (helpful for debugging)
submit_one_run <- function(template_id, prompt_template, provider, model, thinking, direction, run_dir) {
  pairs_use   <- get_pairs_for_direction(direction)
  is_thinking <- identical(thinking, "with_thinking")

  # Provider-specific knobs (passed through via ...)
  extra_args <- list()

  if (identical(provider, "openai")) {
    # Only request thoughts for models that support them in this workflow
    extra_args$include_thoughts <- is_thinking && grepl("^gpt-5\.1", model)
    extra_args$include_raw      <- TRUE
  } else if (identical(provider, "anthropic")) {
    extra_args$reasoning        <- if (is_thinking) "enabled" else "none"
    extra_args$include_thoughts <- is_thinking
    extra_args$include_raw      <- TRUE
    # Optional: set deterministic temperature when not using reasoning
    # Optional: set deterministic temperature when not using reasoning
    if (!is_thinking) extra_args$temperature <- 0
  } else if (identical(provider, "gemini")) {
    extra_args$include_thoughts <- TRUE
    extra_args$thinking_level   <- "low"   # example
    extra_args$include_raw      <- TRUE
  }

  message(
    "Submitting: ", template_id, " | ", provider, " / ", model,
    " / ", thinking, " / ", direction
  )

  # Split strategy:
  # - For real jobs, use batch_size (e.g., 500–5000) or n_segments (e.g., 10–50)
  # - Here we keep it simple and submit a single segment per run
  do.call(
    llm_submit_pairs_multi_batch,
    c(
      list(
        pairs             = pairs_use,
        backend           = provider,
        model             = model,
        trait_name        = td$name,
        trait_description = td$description,
        prompt_template   = prompt_template,
        n_segments        = 1L,
        output_dir        = run_dir,
        write_registry    = TRUE,
        verbose           = TRUE
      ),
      extra_args
    )
  )
}

run_results <- purrr::pmap(
  run_plan,
  submit_one_run
)

# Store a lightweight manifest so you can resume later without rebuilding run_plan
manifest <- run_plan |>
  mutate(registry_path = file.path(run_dir, "jobs_registry.csv"))

manifest_path <- file.path(out_root, "run_manifest.csv")
readr::write_csv(manifest, manifest_path)

manifest_path

At this point, each run directory contains:

  • JSONL input/output placeholders (one per segment)
  • A jobs_registry.csv that records all batch IDs and file paths for that run

You can safely stop R or restart your machine after submission.

6. Polling, Downloading, and Parsing (Resumable)

To poll all runs, read the manifest and call llm_resume_multi_batches() for each run_dir. If you restart R, you can resume without keeping the jobs objects in memory by setting jobs = NULL and pointing to output_dir (the function will load jobs_registry.csv).

manifest_path <- file.path(out_root, "run_manifest.csv")
manifest <- readr::read_csv(manifest_path, show_col_types = FALSE)

poll_one_run <- function(run_dir) {
  llm_resume_multi_batches(
    jobs               = NULL,   # load from jobs_registry.csv in run_dir
    output_dir         = run_dir,
    interval_seconds   = 60,
    per_job_delay      = 2,
    write_results_csv  = TRUE,   # writes batch_XX_results.csv files
    write_registry     = TRUE,   # refreshes jobs_registry.csv with done flags
    keep_jsonl         = TRUE,
    verbose            = TRUE,
    write_combined_csv = TRUE,   # writes combined_results.csv inside run_dir
    combined_csv_path  = "combined_results.csv"
  )
}

polled <- purrr::map(manifest$run_dir, poll_one_run)

6.1 Building a single merged results table (all runs)

Each element of polled contains a combined tibble for that run (i.e., all segments bound together). We can attach run metadata (template/provider/model/thinking/direction) and then bind all runs into one master table.

combined_all <- purrr::map2_dfr(
  polled,
  seq_len(nrow(manifest)),
  function(res, i) {
    meta <- manifest[i, ]
    if (is.null(res$combined)) return(NULL)

    res$combined |>
      mutate(
        template_id = meta$template_id,
        provider    = meta$provider,
        model       = meta$model,
        thinking    = meta$thinking,
        direction   = meta$direction,
        run_id      = meta$run_id
      )
  }
)

combined_path <- file.path(out_root, "combined_all_runs.csv")
readr::write_csv(combined_all, combined_path)

combined_path

7. Resuming After Interruption

Resuming jobs is possible:

  • Submission writes jobs_registry.csv under each run directory
  • Polling can be restarted at any time by calling llm_resume_multi_batches(jobs = NULL, output_dir = <run_dir>)
  • If you keep a run_manifest.csv with run_dir paths, resuming all runs is just a loop

Example: resume only unfinished runs (based on each run’s registry):

manifest <- readr::read_csv(file.path(out_root, "run_manifest.csv"), show_col_types = FALSE)

needs_poll <- function(run_dir) {
  reg_path <- file.path(run_dir, "jobs_registry.csv")
  if (!file.exists(reg_path)) return(FALSE)
  reg <- readr::read_csv(reg_path, show_col_types = FALSE)
  any(!as.logical(reg$done))
}

unfinished_dirs <- manifest$run_dir[vapply(manifest$run_dir, needs_poll, logical(1))]

polled <- purrr::map(unfinished_dirs, poll_one_run)

8. Next Steps

Once you have per-run results CSVs (e.g., one per template × model × thinking × direction), you can:

9. Citation

Mercer, S. (2025). Advanced: Submitting and polling multiple batches [R package vignette]. In pairwiseLLM: Pairwise comparison tools for large language model-based writing evaluation. https://doi.org/10.32614/CRAN.package.pairwiseLLM