Advanced: Submitting and Polling Multiple Batches
Source:vignettes/advanced-batch-workflows.Rmd
advanced-batch-workflows.Rmd1. Overview
This vignette demonstrates how to use pairwiseLLM for Batch API workflows (server-side batching), which are distinct from the live API calls described in the Getting Started vignette.
Batch workflows are ideal for large-scale jobs because they:
- Allow submitting thousands of pairs at once
- Are often cheaper (e.g., discounted batch pricing on some
providers)
- Avoid client-side timeout and connection issues
- Can be polled and resumed even if your local R session ends
Supported Batch API providers:
-
OpenAI (batch pipeline:
run_openai_batch_pipeline()) -
Anthropic (batch pipeline:
run_anthropic_batch_pipeline()) -
Gemini (batch pipeline:
run_gemini_batch_pipeline())
Recommended approach: For multiple batches (e.g., templates × providers × models × forward/reverse), use:
llm_submit_pairs_multi_batch()to split + submit jobs (no polling; writes an optional registry CSV)llm_resume_multi_batches()to poll + download + parse results (can resume from a registry on disk)These helpers orchestrate the provider-specific pipelines without forcing you to write your own polling loops.
Note: Together.ai and Ollama do not currently support a native Batch API compatible with this workflow. For those providers, use the live API wrapper
submit_llm_pairs()as described in the Getting Started vignette.
In this vignette, we will cover:
- Designing a grid of provider/model/thinking/direction combinations
- Submitting many batch jobs using the multi-batch helpers
- Polling and resuming safely via on-disk registries
- Producing per-run and merged results tables
Note: All heavy API calls in this vignette are set to
eval = FALSEso that the vignette remains CRAN-safe. You can enable them in your own project.
For basic function usage, see the companion vignette:
For prompt evaluation and positional-bias diagnostics, see the companion vignette:
2. Setup and API Keys
Required environment variables:
| Provider | Environment Variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| Gemini | GEMINI_API_KEY |
Check which are set:
check_llm_api_keys()
#> No LLM API keys are currently set for known backends:
#> - OpenAI: OPENAI_API_KEY
#> - Anthropic: ANTHROPIC_API_KEY
#> - Google Gemini: GEMINI_API_KEY
#> - Together.ai: TOGETHER_API_KEY
#>
#> Use `usethis::edit_r_environ()` to add the keys persistently, e.g.:
#> OPENAI_API_KEY = "YOUR_OPENAI_KEY_HERE"
#> ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_KEY_HERE"
#> GEMINI_API_KEY = "YOUR_GEMINI_KEY_HERE"
#> TOGETHER_API_KEY = "YOUR_TOGETHER_KEY_HERE"
#> # A tibble: 4 × 4
#> backend service env_var has_key
#> <chr> <chr> <chr> <lgl>
#> 1 openai OpenAI OPENAI_API_KEY FALSE
#> 2 anthropic Anthropic ANTHROPIC_API_KEY FALSE
#> 3 gemini Google Gemini GEMINI_API_KEY FALSE
#> 4 together Together.ai TOGETHER_API_KEY FALSE3. Example Data and Prompt Template
We use the built-in writing samples and a single trait
(overall_quality).
data("example_writing_samples", package = "pairwiseLLM")
td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."Default prompt template:
tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 400), "...
")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the ...Construct a modest number of pairs to keep the example light:
set.seed(123)
pairs_all <- example_writing_samples |>
make_pairs()
n_pairs <- min(40L, nrow(pairs_all))
pairs_forward <- pairs_all |>
sample_pairs(n_pairs = n_pairs, seed = 123) |>
randomize_pair_order(seed = 456)
pairs_reverse <- sample_reverse_pairs(
pairs_forward,
reverse_pct = 1.0,
seed = 789
)
get_pairs_for_direction <- function(direction = c("forward", "reverse")) {
direction <- match.arg(direction)
if (identical(direction, "forward")) {
pairs_forward
} else {
pairs_reverse
}
}4. Designing the Batch Grid
Suppose we want to test several prompt templates across:
- Anthropic models (with/without “thinking”)
- OpenAI models (with/without “thinking” for specific models)
- Gemini models (with “thinking” enabled)
Here we define a small grid:
anthropic_models <- c(
"claude-sonnet-4-5",
"claude-haiku-4-5",
"claude-opus-4-5"
)
gemini_models <- c(
"gemini-3-pro-preview"
)
openai_models <- c(
"gpt-4.1",
"gpt-4o",
"gpt-5.1"
)
thinking_levels <- c("no_thinking", "with_thinking")
directions <- c("forward", "reverse")
anthropic_grid <- tidyr::expand_grid(
provider = "anthropic",
model = anthropic_models,
thinking = thinking_levels,
direction = directions
)
gemini_grid <- tidyr::expand_grid(
provider = "gemini",
model = gemini_models,
thinking = "with_thinking",
direction = directions
)
openai_grid <- tidyr::expand_grid(
provider = "openai",
model = openai_models,
thinking = thinking_levels,
direction = directions
) |>
# For example, only allow "with_thinking" for gpt-5.1
dplyr::filter(model == "gpt-5.1" | thinking == "no_thinking")
batch_grid <- dplyr::bind_rows(
anthropic_grid,
gemini_grid,
openai_grid
)
batch_grid
#> # A tibble: 22 × 4
#> provider model thinking direction
#> <chr> <chr> <chr> <chr>
#> 1 anthropic claude-sonnet-4-5 no_thinking forward
#> 2 anthropic claude-sonnet-4-5 no_thinking reverse
#> 3 anthropic claude-sonnet-4-5 with_thinking forward
#> 4 anthropic claude-sonnet-4-5 with_thinking reverse
#> 5 anthropic claude-haiku-4-5 no_thinking forward
#> 6 anthropic claude-haiku-4-5 no_thinking reverse
#> 7 anthropic claude-haiku-4-5 with_thinking forward
#> 8 anthropic claude-haiku-4-5 with_thinking reverse
#> 9 anthropic claude-opus-4-5 no_thinking forward
#> 10 anthropic claude-opus-4-5 no_thinking reverse
#> # ℹ 12 more rowsWe will also imagine multiple prompt templates have been registered.
For simplicity, we use the same tmpl string, but in
practice you would substitute different text:
templates_tbl <- tibble::tibble(
template_id = c("test1", "test2", "test3", "test4", "test5"),
prompt_template = list(tmpl, tmpl, tmpl, tmpl, tmpl)
)
templates_tbl
#> # A tibble: 5 × 2
#> template_id prompt_template
#> <chr> <list>
#> 1 test1 <chr [1]>
#> 2 test2 <chr [1]>
#> 3 test3 <chr [1]>
#> 4 test4 <chr [1]>
#> 5 test5 <chr [1]>5. Submitting Many Batches with the Multi‑Batch Helpers
The key idea is:
- Each combination of
(template_id, provider, model, thinking, direction)becomes a run - Each run writes its files into its own subdirectory (so file names never collide)
- Within each run you can still split into multiple segments using
batch_sizeorn_segments
5.1 Create a run plan and output directory
out_root <- "dev-output/advanced-multi-batch"
dir.create(out_root, recursive = TRUE, showWarnings = FALSE)
run_plan <- tidyr::crossing(
templates_tbl |> tidyr::unnest(prompt_template),
batch_grid
) |>
mutate(
run_id = paste(template_id, provider, model, thinking, direction, sep = "__"),
run_id = gsub("[^A-Za-z0-9_.-]+", "-", run_id),
run_dir = file.path(out_root, run_id)
)
run_plan |> dplyr::select(run_id, template_id, provider, model, thinking, direction, run_dir)5.2 Submit all runs (no polling)
Below we submit each run using
llm_submit_pairs_multi_batch(). This returns a
jobs list and writes a jobs_registry.csv under
each run directory (because write_registry = TRUE).
Provider-specific options can be forwarded via .... In
the example below we:
- Enable “thinking” output where applicable
- Ask providers to include raw outputs in addition to parsed tags (helpful for debugging)
submit_one_run <- function(template_id, prompt_template, provider, model, thinking, direction, run_dir) {
pairs_use <- get_pairs_for_direction(direction)
is_thinking <- identical(thinking, "with_thinking")
# Provider-specific knobs (passed through via ...)
extra_args <- list()
if (identical(provider, "openai")) {
# Only request thoughts for models that support them in this workflow
extra_args$include_thoughts <- is_thinking && grepl("^gpt-5\.1", model)
extra_args$include_raw <- TRUE
} else if (identical(provider, "anthropic")) {
extra_args$reasoning <- if (is_thinking) "enabled" else "none"
extra_args$include_thoughts <- is_thinking
extra_args$include_raw <- TRUE
# Optional: set deterministic temperature when not using reasoning
# Optional: set deterministic temperature when not using reasoning
if (!is_thinking) extra_args$temperature <- 0
} else if (identical(provider, "gemini")) {
extra_args$include_thoughts <- TRUE
extra_args$thinking_level <- "low" # example
extra_args$include_raw <- TRUE
}
message(
"Submitting: ", template_id, " | ", provider, " / ", model,
" / ", thinking, " / ", direction
)
# Split strategy:
# - For real jobs, use batch_size (e.g., 500–5000) or n_segments (e.g., 10–50)
# - Here we keep it simple and submit a single segment per run
do.call(
llm_submit_pairs_multi_batch,
c(
list(
pairs = pairs_use,
backend = provider,
model = model,
trait_name = td$name,
trait_description = td$description,
prompt_template = prompt_template,
n_segments = 1L,
output_dir = run_dir,
write_registry = TRUE,
verbose = TRUE
),
extra_args
)
)
}
run_results <- purrr::pmap(
run_plan,
submit_one_run
)
# Store a lightweight manifest so you can resume later without rebuilding run_plan
manifest <- run_plan |>
mutate(registry_path = file.path(run_dir, "jobs_registry.csv"))
manifest_path <- file.path(out_root, "run_manifest.csv")
readr::write_csv(manifest, manifest_path)
manifest_pathAt this point, each run directory contains:
- JSONL input/output placeholders (one per segment)
- A
jobs_registry.csvthat records all batch IDs and file paths for that run
You can safely stop R or restart your machine after submission.
6. Polling, Downloading, and Parsing (Resumable)
To poll all runs, read the manifest and call
llm_resume_multi_batches() for each run_dir.
If you restart R, you can resume without keeping the
jobs objects in memory by setting jobs = NULL
and pointing to output_dir (the function will load
jobs_registry.csv).
manifest_path <- file.path(out_root, "run_manifest.csv")
manifest <- readr::read_csv(manifest_path, show_col_types = FALSE)
poll_one_run <- function(run_dir) {
llm_resume_multi_batches(
jobs = NULL, # load from jobs_registry.csv in run_dir
output_dir = run_dir,
interval_seconds = 60,
per_job_delay = 2,
write_results_csv = TRUE, # writes batch_XX_results.csv files
write_registry = TRUE, # refreshes jobs_registry.csv with done flags
keep_jsonl = TRUE,
verbose = TRUE,
write_combined_csv = TRUE, # writes combined_results.csv inside run_dir
combined_csv_path = "combined_results.csv"
)
}
polled <- purrr::map(manifest$run_dir, poll_one_run)6.1 Building a single merged results table (all runs)
Each element of polled contains a combined
tibble for that run (i.e., all segments bound together). We can attach
run metadata (template/provider/model/thinking/direction) and then bind
all runs into one master table.
combined_all <- purrr::map2_dfr(
polled,
seq_len(nrow(manifest)),
function(res, i) {
meta <- manifest[i, ]
if (is.null(res$combined)) return(NULL)
res$combined |>
mutate(
template_id = meta$template_id,
provider = meta$provider,
model = meta$model,
thinking = meta$thinking,
direction = meta$direction,
run_id = meta$run_id
)
}
)
combined_path <- file.path(out_root, "combined_all_runs.csv")
readr::write_csv(combined_all, combined_path)
combined_path7. Resuming After Interruption
Resuming jobs is possible:
- Submission writes
jobs_registry.csvunder each run directory - Polling can be restarted at any time by calling
llm_resume_multi_batches(jobs = NULL, output_dir = <run_dir>) - If you keep a
run_manifest.csvwithrun_dirpaths, resuming all runs is just a loop
Example: resume only unfinished runs (based on each run’s registry):
manifest <- readr::read_csv(file.path(out_root, "run_manifest.csv"), show_col_types = FALSE)
needs_poll <- function(run_dir) {
reg_path <- file.path(run_dir, "jobs_registry.csv")
if (!file.exists(reg_path)) return(FALSE)
reg <- readr::read_csv(reg_path, show_col_types = FALSE)
any(!as.logical(reg$done))
}
unfinished_dirs <- manifest$run_dir[vapply(manifest$run_dir, needs_poll, logical(1))]
polled <- purrr::map(unfinished_dirs, poll_one_run)8. Next Steps
Once you have per-run results CSVs (e.g., one per template × model × thinking × direction), you can:
- Compute reverse consistency with
compute_reverse_consistency() - Analyze positional bias with
check_positional_bias() - Aggregate results by provider/model/template using standard
dplyrpipelines - Fit Bradley–Terry models with
build_bt_data()+fit_bt_model() - Fit Elo models with
fit_elo_model()(whenEloChoiceis installed)
9. Citation
Mercer, S. (2025). Advanced: Submitting and polling multiple batches [R package vignette]. In pairwiseLLM: Pairwise comparison tools for large language model-based writing evaluation. https://doi.org/10.32614/CRAN.package.pairwiseLLM