Estimate LLM token usage and cost for a set of pairwise comparisons

Estimate total token usage and cost for running a large set of pairwise comparisons by:

running a small pilot on n_test pairs (live calls) to observe prompt_tokens and completion_tokens, and
using the pilot to calibrate a prompt-bytes-to-input-token model for the remaining pairs, and
prorating output tokens for the remaining pairs from the pilot distribution.

Usage

estimate_llm_pairs_cost(
  pairs,
  model,
  trait_name,
  trait_description,
  prompt_template = set_prompt_template(),
  backend = c("openai", "anthropic", "gemini", "together"),
  endpoint = c("chat.completions", "responses"),
  mode = c("live", "batch"),
  n_test = 25,
  test_strategy = c("stratified_prompt_bytes", "random", "first"),
  seed = NULL,
  cost_per_million_input,
  cost_per_million_output,
  batch_discount = 1,
  budget_quantile = 0.9,
  return_test_results = TRUE,
  return_remaining_pairs = TRUE,
  ...
)

Arguments

pairs: Tibble or data frame with at least columns ID1, text1, ID2, text2. Typically created by make_pairs, sample_pairs, and randomize_pair_order.
model: Model name to use for the pilot run (and for the target job).
trait_name: Short label for the trait (for example "Overall Quality").
trait_description: Full-text description of the trait or rubric.
prompt_template: Prompt template string, typically from set_prompt_template.
backend: Backend for the pilot run; one of "openai", "anthropic", "gemini", or "together".
endpoint: OpenAI endpoint; one of "chat.completions" or "responses". Ignored for other backends.
mode: Target execution mode for the full job; one of "live" or "batch". The pilot is always run live. If mode = "batch", batch_discount is applied to the estimated cost for the remaining (non-pilot) pairs.
n_test: Number of pilot pairs to run live. Defaults to 25 or fewer if fewer pairs are supplied.
test_strategy: Strategy for selecting pilot pairs: "stratified_prompt_bytes" (default), "random", or "first".
seed: Optional integer seed used for pilot sampling when test_strategy is not "first".
cost_per_million_input: Cost per one million input tokens (prompt tokens), in your currency of choice.
cost_per_million_output: Cost per one million output tokens (completion tokens). Reasoning/thinking tokens are treated as output.
batch_discount: Numeric scalar multiplier applied to the estimated cost for the remaining pairs when mode = "batch". For example, if batch pricing is 50 percent of live pricing, use batch_discount = 0.5.
budget_quantile: Quantile used for the "budget" output-token estimate for remaining pairs. Defaults to 0.9 (p90).
return_test_results: Logical; if TRUE, include pilot results in the returned object so you can reuse them and avoid paying twice.
return_remaining_pairs: Logical; if TRUE, include the remaining pairs (excluding pilot pairs) in the returned object.
...: Additional arguments forwarded to submit_llm_pairs for the pilot run (for example api_key, reasoning, include_thoughts, max_tokens, etc.).

Value

An object of class "pairwiseLLM_cost_estimate", a list with:

summary: A one-row tibble with expected and budget token and cost estimates (and pilot usage).
calibration: A list describing the input-token calibration (coefficients and fit diagnostics).
test_pairs: The pilot pair subset.
pilot: Pilot results (when return_test_results = TRUE).
remaining_pairs: Remaining pairs (when return_remaining_pairs = TRUE).

Details

The estimator does not require a provider tokenizer. Input tokens are estimated from the byte length of the fully constructed prompt and calibrated on the pilot's observed prompt_tokens.

Examples

if (FALSE) { # \dontrun{
# Requires an API key and internet access.
data("example_writing_samples", package = "pairwiseLLM")

pairs <- example_writing_samples |>
  make_pairs() |>
  sample_pairs(n_pairs = 50, seed = 123)

td <- trait_description("overall_quality")
tmpl <- set_prompt_template()

est <- estimate_llm_pairs_cost(
  pairs = pairs,
  backend = "openai",
  model = "gpt-4.1",
  endpoint = "chat.completions",
  trait_name = td$name,
  trait_description = td$description,
  prompt_template = tmpl,
  mode = "batch",
  batch_discount = 0.5,
  n_test = 10,
  cost_per_million_input = 0.15,
  cost_per_million_output = 0.60
)

est
est$summary

# Reuse pilot results and run only remaining pairs:
remaining <- est$remaining_pairs
} # }