This function reads an OpenAI Batch API output file (JSONL) and extracts
pairwise comparison results for use with Bradley–Terry models. It supports
both the Chat Completions endpoint (where object = "chat.completion")
and the Responses endpoint (where object = "response"), including
GPT-5.1 with reasoning.
Usage
parse_openai_batch_output(
path,
tag_prefix = "<BETTER_SAMPLE>",
tag_suffix = "</BETTER_SAMPLE>"
)Value
A tibble with one row per successfully parsed comparison and columns:
- custom_id
The
custom_idfrom the batch request.- ID1, ID2
Sample IDs inferred from
custom_id.- model
The model name reported by the API.
- object_type
The OpenAI response object type (e.g.,
"chat.completion"or"response").- status_code
HTTP-style status code from the batch output.
- error_message
Error message, if present; otherwise
NA.- thoughts
Reasoning / thinking summary text when available (for Responses with reasoning); otherwise
NA.- content
The raw assistant visible content string (the LLM's output), used to locate the
<BETTER_SAMPLE>tag. For Responses with reasoning this does not include reasoning summaries, which are kept inthoughts.- better_sample
Either
"SAMPLE_1","SAMPLE_2", orNAif the tag was not found.- better_id
ID1ifSAMPLE_1was chosen,ID2ifSAMPLE_2was chosen, orNA.- prompt_tokens
Prompt/input token count (if reported).
- completion_tokens
Completion/output token count (if reported).
- total_tokens
Total tokens (if reported).
- prompt_cached_tokens
Cached prompt tokens (if reported via
input_tokens_details$cached_tokens); otherwiseNA.- reasoning_tokens
Reasoning tokens (if reported via
output_tokens_details$reasoning_tokens); otherwiseNA.
Details
For each line, the function:
extracts
custom_idand parsesID1andID2from the pattern"<prefix>ID1_vs_ID2",pulls the raw LLM content containing the
<BETTER_SAMPLE>...</BETTER_SAMPLE>tag,determines whether
SAMPLE_1orSAMPLE_2was selected and maps that tobetter_id,collects model name and token usage statistics (including reasoning tokens for GPT-5.1 Responses),
when using the Responses endpoint with reasoning, separates reasoning summaries into the
thoughtscolumn and visible assistant output intocontent.
The returned data frame is suitable as input for
build_bt_data.
Examples
# Create a temporary JSONL file containing a simulated OpenAI batch result
tf <- tempfile(fileext = ".jsonl")
# A single line of JSON representing a successful Chat Completion
# custom_id implies "LIVE_" prefix, ID1="A", ID2="B"
json_line <- paste0(
'{"custom_id": "LIVE_A_vs_B", ',
'"response": {"status_code": 200, "body": {',
'"object": "chat.completion", ',
'"model": "gpt-4", ',
'"choices": [{"message": {"content": "<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>"}}], ',
'"usage": {"prompt_tokens": 50, "completion_tokens": 10, "total_tokens": 60}}}}'
)
writeLines(json_line, tf)
# Parse the output
res <- parse_openai_batch_output(tf)
# Inspect the result
print(res$better_id)
#> [1] "A"
print(res$prompt_tokens)
#> [1] 50
# Clean up
unlink(tf)