Execute stepwise adaptive ranking with a user-supplied judge.
Usage
adaptive_rank_run_live(
state,
judge,
n_steps = 1L,
fit_fn = NULL,
adaptive_config = NULL,
btl_config = NULL,
session_dir = NULL,
persist_item_log = NULL,
progress = c("all", "refits", "steps", "none"),
progress_redraw_every = 10L,
progress_show_events = TRUE,
progress_errors = TRUE,
...
)Arguments
- state
An adaptive state object created by
adaptive_rank_start().- judge
A function called as
judge(A, B, state, ...)that returns a list withis_valid = TRUEandYin0/1, oris_valid = FALSEwithinvalid_reason.- n_steps
Maximum number of attempted adaptive steps to execute in this call. The run may terminate earlier if candidate starvation is encountered or if BTL stopping criteria are met at a refit. Each attempted step counts toward this budget, including invalid judge responses.
- fit_fn
Optional BTL fit function for deterministic testing; defaults to
default_btl_fit_fn()when a refit is due.- adaptive_config
Optional named list overriding adaptive controller behavior. Supported fields:
global_identified_reliability_min,global_identified_rank_corr_min,p_long_low,p_long_high,long_taper_mult,long_frac_floor,mid_bonus_frac,explore_taper_mult,boundary_k,boundary_window,boundary_frac,p_star_override_margin, andstar_override_budget_per_round. Unknown fields and invalid values abort with an actionable error.- btl_config
Optional named list overriding BTL refit cadence, stopping thresholds, and selected round-log diagnostics. Supported fields:
refit_pairs_targetMinimum new committed comparisons required before the next BTL refit.
model_variantBTL MCMC variant:
"btl","btl_e","btl_b", or"btl_e_b".ess_bulk_minMinimum bulk ESS required for diagnostics to pass.
ess_bulk_min_near_stopStricter ESS requirement when a run is close to stopping.
max_rhatMaximum allowed split-\(\hat{R}\) diagnostic value.
divergences_maxMaximum allowed divergent transitions.
eap_reliability_minMinimum EAP reliability to allow stopping.
stability_lagLag (in refits) used for stability checks.
theta_corr_minMinimum lagged correlation of posterior means.
theta_sd_rel_change_maxMaximum relative change in posterior SD allowed by stability checks.
rank_spearman_minMinimum lagged Spearman rank correlation.
near_tie_p_low,near_tie_p_highProbability band used only for near-tie diagnostics in round logging (not used for stopping decisions).
Defaults are resolved from the current item count, then merged with user overrides.
- session_dir
Optional directory for saving session artifacts.
- persist_item_log
Logical; when TRUE, write per-refit item logs to disk.
- progress
Progress output: "all", "refits", "steps", or "none".
- progress_redraw_every
Redraw progress bar every N steps.
- progress_show_events
Logical; when TRUE, print notable step events.
- progress_errors
Logical; when TRUE, include invalid-step events.
- ...
Additional arguments passed through to
judge().
Value
An updated adaptive_state. The returned state includes
appended step_log rows for attempted steps and, when refits occur,
appended round_log and item_log entries.
Details
Each iteration attempts at most one pair evaluation ("one-pair step"), then
applies transactional updates if and only if the judge response is valid.
Invalid responses produce a logged step with
pair_id = NA and must not update committed-comparison state.
Pair selection is TrueSkill-based and does not use BTL posterior draws.
Utility is based on
$$U_0 = p_{ij}(1 - p_{ij})$$ with exploration/exploitation routing and
fallback handling recorded in step_log.
Round scheduling uses stage-specific admissibility:
rolling-anchor links compare one anchor and one non-anchor endpoint;
long/mid links exclude anchor endpoints and enforce stratum-distance bounds;
local-link routing admits same-stratum pairs and anchor-involving pairs within local stage bounds.
Exposure and repeat handling are soft, stage-local constraints:
under-represented exploration uses degree set deg <= D_min + 1, while
repeat-pressure gating uses bottom-quantile recent_deg (default quantile
0.25) and per-endpoint repeat-slot accounting against
repeat_in_round_budget.
Top-band defaults for stratum construction are
top_band_pct = 0.10 and top_band_bins = 5, with top-band size
ceiling(top_band_pct * N).
Bayesian BTL refits are triggered on step-based cadence and evaluated with
diagnostics gates (including ESS thresholds), reliability, and lagged
stability criteria. Refit-level outcomes are
appended to round_log; per-item posterior summaries are appended to
item_log. Controller behavior can change after refits via
identifiability-gated settings in adaptive_config; those controls
affect pair routing and quotas, while BTL remains inference-only.
Examples
# ------------------------------------------------------------------
# Offline end-to-end workflow (fast, deterministic, CRAN-safe)
# ------------------------------------------------------------------
data("example_writing_samples", package = "pairwiseLLM")
items <- dplyr::rename(
example_writing_samples[1:8, c("ID", "text", "quality_score")],
item_id = ID
)
# Use the package defaults for trait and prompt template.
trait <- trait_description("overall_quality")
prompt_template <- set_prompt_template()
# Deterministic local judge based on fixture quality scores.
sim_judge <- function(A, B, state, ...) {
y <- as.integer(A$quality_score[[1]] >= B$quality_score[[1]])
list(is_valid = TRUE, Y = y, invalid_reason = NA_character_)
}
session_dir <- tempfile("pwllm-adaptive-session-")
state <- adaptive_rank_start(
items = items,
seed = 42,
adaptive_config = list(
global_identified_reliability_min = 0.85,
star_override_budget_per_round = 2L
),
session_dir = session_dir,
persist_item_log = TRUE
)
state <- adaptive_rank_run_live(
state = state,
judge = sim_judge,
n_steps = 6,
btl_config = list(
# Keep examples lightweight while showing custom stop config inputs.
refit_pairs_target = 50L,
ess_bulk_min = 400,
eap_reliability_min = 0.90
),
adaptive_config = list(
explore_taper_mult = 0.40,
boundary_frac = 0.20
),
progress = "steps",
progress_redraw_every = 1L,
progress_show_events = TRUE,
progress_errors = TRUE
)
#> step 1: new_pairs_since_last_refit=1/50 committed=1 invalid=0 starved=0
#> step 2: new_pairs_since_last_refit=2/50 committed=2 invalid=0 starved=0
#> step 3: new_pairs_since_last_refit=3/50 committed=3 invalid=0 starved=0
#> step 4: new_pairs_since_last_refit=4/50 committed=4 invalid=0 starved=0
#> step 5: new_pairs_since_last_refit=5/50 committed=5 invalid=0 starved=0
#> step 6: new_pairs_since_last_refit=6/50 committed=6 invalid=0 starved=0
# Print and inspect run outputs.
print(state)
#> Adaptive state
#> items: 8
#> steps: 6 (committed=6)
#> refits: 0
#> last stop: continue
run_summary <- summarize_adaptive(state)
step_view <- adaptive_step_log(state)
logs <- adaptive_get_logs(state)
run_summary
#> # A tibble: 1 × 6
#> n_items steps_attempted committed_pairs n_refits last_stop_decision
#> <int> <int> <int> <int> <lgl>
#> 1 8 6 6 0 FALSE
#> # ℹ 1 more variable: last_stop_reason <chr>
head(step_view)
#> # A tibble: 6 × 51
#> step_id timestamp pair_id i j A B Y status
#> <int> <dttm> <int> <int> <int> <int> <int> <int> <chr>
#> 1 1 2026-02-11 04:39:08 1 1 5 1 5 0 ok
#> 2 2 2026-02-11 04:39:08 2 5 8 5 8 0 ok
#> 3 3 2026-02-11 04:39:08 3 8 6 8 6 1 ok
#> 4 4 2026-02-11 04:39:08 4 6 2 6 2 1 ok
#> 5 5 2026-02-11 04:39:08 5 2 4 2 4 0 ok
#> 6 6 2026-02-11 04:39:09 6 4 3 4 3 1 ok
#> # ℹ 42 more variables: round_id <int>, round_stage <chr>, pair_type <chr>,
#> # used_in_round_i <int>, used_in_round_j <int>, is_anchor_i <lgl>,
#> # is_anchor_j <lgl>, stratum_i <int>, stratum_j <int>, dist_stratum <int>,
#> # stage_committed_so_far <int>, stage_quota <int>, is_explore_step <lgl>,
#> # explore_mode <chr>, explore_reason <chr>, explore_rate_used <dbl>,
#> # local_priority_mode <chr>, long_gate_pass <lgl>, long_gate_reason <chr>,
#> # star_override_used <lgl>, star_override_reason <chr>, …
names(logs)
#> [1] "step_log" "round_log" "item_log"
# Resume from disk and continue.
resumed <- adaptive_rank_resume(session_dir)
resumed <- adaptive_rank_run_live(
state = resumed,
judge = sim_judge,
n_steps = 4,
progress = "none"
)
summarize_adaptive(resumed)
#> # A tibble: 1 × 6
#> n_items steps_attempted committed_pairs n_refits last_stop_decision
#> <int> <int> <int> <int> <lgl>
#> 1 8 10 9 0 FALSE
#> # ℹ 1 more variable: last_stop_reason <chr>
# ------------------------------------------------------------------
# Live OpenAI workflow via backend-agnostic llm_compare_pair()
# ------------------------------------------------------------------
if (FALSE) { # \dontrun{
# Requires network + OPENAI_API_KEY. This incurs API cost.
# check_llm_api_keys() is a quick preflight.
check_llm_api_keys()
data("example_writing_samples", package = "pairwiseLLM")
live_items <- dplyr::rename(
example_writing_samples[1:12, c("ID", "text")],
item_id = ID
)
# Default trait/template setup used by the backend-agnostic runner.
trait <- trait_description("overall_quality")
prompt_template <- set_prompt_template()
live_session_dir <- file.path(tempdir(), "pwllm-adaptive-openai")
judge_openai <- function(A, B, state, ...) {
res <- llm_compare_pair(
ID1 = A$item_id[[1]],
text1 = A$text[[1]],
ID2 = B$item_id[[1]],
text2 = B$text[[1]],
model = "gpt-5.1",
trait_name = trait$name,
trait_description = trait$description,
prompt_template = prompt_template,
backend = "openai",
endpoint = "responses",
reasoning = "low",
service_tier = "flex",
include_thoughts = FALSE,
temperature = NULL,
top_p = NULL,
logprobs = NULL
)
better_id <- res$better_id[[1]]
ok_ids <- c(A$item_id[[1]], B$item_id[[1]])
if (is.na(better_id) || !(better_id %in% ok_ids)) {
return(list(
is_valid = FALSE,
Y = NA_integer_,
invalid_reason = "model_response_invalid"
))
}
list(
is_valid = TRUE,
Y = as.integer(identical(better_id, A$item_id[[1]])),
invalid_reason = NA_character_
)
}
state_live <- adaptive_rank_start(
items = live_items,
seed = 2026,
session_dir = live_session_dir,
persist_item_log = TRUE
)
state_live <- adaptive_rank_run_live(
state = state_live,
judge = judge_openai,
n_steps = 120L,
btl_config = list(
refit_pairs_target = 20L,
ess_bulk_min = 500,
ess_bulk_min_near_stop = 1200,
max_rhat = 1.01,
divergences_max = 0L,
eap_reliability_min = 0.92,
stability_lag = 2L,
theta_corr_min = 0.97,
theta_sd_rel_change_max = 0.08,
rank_spearman_min = 0.97
),
progress = "all",
progress_redraw_every = 1L,
progress_show_events = TRUE,
progress_errors = TRUE
)
# Reporting outputs for end users.
print(state_live)
run_summary <- summarize_adaptive(state_live)
refit_summary <- summarize_refits(state_live)
item_summary <- summarize_items(state_live)
logs <- adaptive_get_logs(state_live)
# Store outputs for audit/reproducibility.
saveRDS(
list(
run_summary = run_summary,
refit_summary = refit_summary,
item_summary = item_summary,
logs = logs
),
file.path(live_session_dir, "adaptive_outputs.rds")
)
# Resume from stored state and continue sampling.
state_live <- adaptive_rank_resume(live_session_dir)
state_live <- adaptive_rank_run_live(
state = state_live,
judge = judge_openai,
n_steps = 40L,
progress = "refits"
)
print(summarize_adaptive(state_live))
} # }