Skip to contents
pairwiseLLM banner
pairwiseLLM banner

pairwiseLLM is a R package that provides a unified, extensible framework for generating, submitting, and modeling pairwise comparisons of writing quality using large language models (LLMs).

It includes:

  • Unified live and batch APIs across OpenAI, Anthropic, and Gemini
  • A prompt template registry with tested templates designed to reduce positional bias
  • Positional-bias diagnostics (forward vs reverse design)
  • Bradley–Terry (BT) and Elo modeling
  • Consistent data structures for all providers

Vignettes

Several vignettes are available to demonstrate functionality.

For basic function usage, see:

For advanced batch processing workflows, see:

For information on prompt evaluation and positional-bias diagnostics, see:


Supported Models

The following models are confirmed to work for pairwise comparisons:

Provider Model Reasoning Mode?
OpenAI gpt-5.2 ✅ Yes
OpenAI gpt-5.1 ✅ Yes
OpenAI gpt-4o ❌ No
OpenAI gpt-4.1 ❌ No
Anthropic claude-sonnet-4-5 ✅ Yes
Anthropic claude-haiku-4-5 ✅ Yes
Anthropic claude-opus-4-5 ✅ Yes
Google/Gemini gemini-3-pro-preview ✅ Yes
DeepSeek-AI1 DeepSeek-R1 ✅ Yes
DeepSeek-AI1 DeepSeek-V3 ❌ No
Moonshot-AI1 Kimi-K2-Instruct-0905 ❌ No
Qwen1 Qwen3-235B-A22B-Instruct-2507 ❌ No
Qwen2 qwen3:32b ✅ Yes
Google2 gemma3:27b ❌ No
Mistral2 mistral-small3.2:24b ❌ No

1 via the together.ai API

2 via Ollama on a local machine

Batch APIs are currently available for OpenAI, Anthropic, and Gemini only. Models accessed via Together.ai and Ollama are supported for live comparisons via submit_llm_pairs() / llm_compare_pair().

Backend Live Batch
openai
anthropic
gemini
together
ollama

Installation

Once the package is available on CRAN, install with:

install.packages("pairwiseLLM")

To install the development version from GitHub:

# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")

Load the package:


API Keys

pairwiseLLM reads keys only from environment variables.
Keys are never printed, never stored, and never written to disk.

You can verify which providers are available using:

This returns a tibble showing whether R can see the required keys for:

  • OpenAI
  • Anthropic
  • Google Gemini
  • Together.ai

Setting API Keys

You may set keys temporarily for the current R session:

Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")

…but for normal use and for reproducible analyses, it is strongly recommended
to store them in your ~/.Renviron file.

Open your .Renviron file:

usethis::edit_r_environ()

Add the following lines:

OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"

Save the file, then restart R.

You can confirm that R now sees the keys:


Core Concepts

At a high level, pairwiseLLM workflows follow this structure:

  1. Writing samples – e.g., essays, constructed responses, short answers.
  2. Trait – a rating dimension such as “overall quality” or “organization”.
  3. Pairs – pairs of samples to be compared for that trait.
  4. Prompt template – instructions + placeholders for {TRAIT_NAME}, {TRAIT_DESCRIPTION}, {SAMPLE_1}, {SAMPLE_2}.
  5. Backend – which provider/model to use (OpenAI, Anthropic, Gemini, Together, Ollama).
  6. Modeling – convert pairwise results to latent scores via BT or Elo.

The package provides helpers for each step.


Prompt Templates & Registry

pairwiseLLM includes:

  • A default template tested for positional bias
  • Support for multiple templates stored by name
  • User-defined templates via register_prompt_template()

View available templates

list_prompt_templates()
#> [1] "default" "test1"   "test2"   "test3"   "test4"   "test5"

Show the default template (truncated)

tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#> 
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#> 
#> SAMPLES:
#> 
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#> 
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#> 
#> EVALUATION PROCESS (Mental Simulation):
#> 
#> 1.  **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the  ...

Register your own template

register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…

{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.

SAMPLE 1:
{SAMPLE_1}

SAMPLE 2:
{SAMPLE_2}

<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")

Use it in a submission:

tmpl <- get_prompt_template("my_template")

Trait Descriptions

Traits define what “quality” means.

trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#> 
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n      how clearly the writing is organized, and how effective the language and\n      conventions are."

You can also provide custom traits:

trait_description(
  custom_name        = "Clarity",
  custom_description = "How understandable, coherent, and well structured the ideas are."
)

Live Comparisons

Use the unified API for direct API calls. The submit_llm_pairs() function supports parallel processing and incremental output saving for all supported backends (OpenAI, Anthropic, Gemini, Together, and Ollama).

Key Features:

  • Parallel Execution: Set parallel = TRUE and workers = n to speed up processing.
  • Resume Capability: Provide a save_path (e.g., "results.csv"). The function writes results as they finish. If interrupted, running the command again will automatically skip pairs already present in the file.
  • Robust Output: Returns a list containing $results (successful comparisons) and $failed_pairs (errors), ensuring one bad request doesn’t crash the whole job.

Example:

data("example_writing_samples")

pairs <- example_writing_samples |>
  make_pairs() |>
  sample_pairs(10, seed = 123) |>
  randomize_pair_order()

td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")

# Run in parallel with incremental saving
res_list <- submit_llm_pairs(
  pairs             = pairs,
  backend           = "openai",
  model             = "gpt-4o",
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl,
  parallel          = TRUE,
  workers           = 4,
  save_path         = "live_results.csv"
)

# Inspect successes
head(res_list$results)

# Inspect failures (if any)
if (nrow(res_list$failed_pairs) > 0) {
  print(res_list$failed_pairs)
}

Batch Comparisons

Large-scale runs use:

Example:

batch <- llm_submit_pairs_batch(
  backend           = "anthropic",
  model             = "claude-sonnet-4-5",
  pairs             = pairs,
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl
)

results <- llm_download_batch_results(batch)

Cost Estimation

Before running a large live or batch job, you can estimate token usage and cost with estimate_llm_pairs_cost(). The estimator:

  • Runs a small pilot on n_test pairs (live calls) to observe prompt_tokens and completion_tokens
  • Uses the pilot to calibrate a prompt-bytes → input-tokens model for the remaining pairs
  • Estimates output tokens for the remaining pairs using the pilot distribution and calculates costs (expected = 50th %ile; budget = 90th %ile).

Example (batch pricing discount + budget cost)

data("example_writing_samples", package = "pairwiseLLM")

pairs <- example_writing_samples |>
  make_pairs() |>
  sample_pairs(n_pairs = 200, seed = 123) |>
  randomize_pair_order(seed = 456)

td   <- trait_description("overall_quality")
tmpl <- set_prompt_template()

# Estimate cost using a small pilot run (live calls).
# If your provider offers discounted batch pricing, set batch_discount accordingly.
est <- estimate_llm_pairs_cost(
  pairs = pairs,
  backend = "openai",
  model = "gpt-4.1",
  endpoint = "chat.completions",
  trait_name = td$name,
  trait_description = td$description,
  prompt_template = tmpl,
  mode = "batch",
  batch_discount = 0.5,              # e.g., batch costs 50 percent of live
  n_test = 10,                       # number of paid pilot calls
  budget_quantile = 0.9,             # "budget" uses p90 output tokens
  cost_per_million_input = 3.00,     # set these to your provider pricing
  cost_per_million_output = 12.00
)

est
est$summary

Reuse pilot results (avoid paying twice)

By default, the estimator returns the pilot results and the remaining pairs. This lets you run the pilot once, then submit only the remaining pairs:

# Pairs not included in the pilot:
remaining_pairs <- est$remaining_pairs

# Submit remaining pairs using your preferred workflow (live):
res_live <- submit_llm_pairs(remaining_pairs, backend = "openai", model = "gpt-4.1", ...)

# For batch:
batch <- llm_submit_pairs_batch(
          backend = "openai",
          model = "gpt-4.1",
          pairs = remaining_pairs,
          trait_name = td$name,
          trait_description = td$description,
          prompt_template = tmpl)

results <- llm_download_batch_results(batch)

Multi‑Batch Jobs

For very large jobs or when you need to restart polling after an interruption, pairwiseLLM provides two convenience helpers that wrap the low–level batch APIs:

  • llm_submit_pairs_multi_batch() — divides a table of pairwise comparisons into multiple batch jobs, uploads the input JSONL files, creates the batches, and optionally writes a registry CSV containing all batch IDs and file paths. You can split by specifying either n_segments (number of jobs) or batch_size (maximum number of pairs per job).
  • llm_resume_multi_batches() — polls all unfinished batches, downloads and parses the results as soon as each job completes, and optionally writes per‑job result CSVs and a single combined CSV with the merged results.

Use these helpers when your dataset is large or if you anticipate having to pause and resume the job.

Example: splitting and resuming

data("example_writing_samples", package = "pairwiseLLM")

# construct 100 pairs and a trait description
pairs <- example_writing_samples |>
  make_pairs() |>
  sample_pairs(n_pairs = 100, seed = 123) |>
  randomize_pair_order(seed = 456)

td   <- trait_description("overall_quality")
tmpl <- set_prompt_template()

# 1. Submit the pairs as 10 separate batches and write a registry CSV to disk.
multi_job <- llm_submit_pairs_multi_batch(
  pairs             = pairs,
  backend           = "openai",
  model             = "gpt-5.2",
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl,
  n_segments        = 10,
  output_dir        = "directory_name/",
  write_registry    = TRUE,
  include_thoughts  = TRUE
)

# 2. Later (or in a new session), resume polling and download results.
res <- llm_resume_multi_batches(
  jobs               = multi_job$jobs,
  interval_seconds   = 60,
  write_results_csv  = TRUE,
  write_combined_csv = TRUE,
  keep_jsonl         = FALSE
)

head(res$combined)

The registry CSV contains all batch IDs and file paths, allowing you to resume polling with llm_resume_multi_batches() even if the R session is interrupted.


Positional Bias Testing

LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.

Typical workflow

pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)

Submit:

# Submit forward pairs
out_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)

# Submit reverse pairs
out_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)

Compute bias:

cons <- compute_reverse_consistency(out_fwd$results, out_rev$results)
bias <- check_positional_bias(cons)

cons$summary
bias$summary

Positional-bias tested templates

Five included templates have been tested across different backend providers. Complete details are presented in a vignette: vignette("prompt-template-bias")


Bradley–Terry & Elo Modeling

Bradley–Terry (BT)

# res_list: output from submit_llm_pairs() 
bt_data <- build_bt_data(res_list$results)
bt_fit <- fit_bt_model(bt_data)
summarize_bt_fit(bt_fit)

Elo Modeling

# res_list: output from submit_llm_pairs() 
elo_data <- build_elo_data(res_list$results)
elo_fit <- fit_elo_model(elo_data, runs = 5)

elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weighted

Live vs Batch Summary

Workflow Use Case Functions
Live small or interactive runs submit_llm_pairs, llm_compare_pair
Batch large jobs, cost control llm_submit_pairs_batch, llm_download_batch_results

Contributing

Contributions to pairwiseLLM are very welcome!

  • Bug reports (with reproducible examples when possible)
  • Feature requests, ideas, and discussion
  • Pull requests improving:
    • functionality
    • documentation
    • examples / vignettes
    • test coverage
  • Backend integrations (e.g., additional LLM providers or local inference engines)
  • Modeling extensions

Reporting issues

If you encounter a problem:

  1. Run:

    devtools::session_info()
  2. Include:

    • reproducible code
    • the error message
    • the model/backend involved
    • your operating system
  3. Open an issue at:
    https://github.com/shmercer/pairwiseLLM/issues


License

MIT License. See LICENSE.


Package Author and Maintainer


Citation

Mercer, S. H. (2025). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.2.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM