Parse an OpenAI Batch output JSONL file — parse_openai_batch

This function reads an OpenAI Batch API output file (JSONL) and extracts pairwise comparison results for use with Bradley–Terry models. It supports both the Chat Completions endpoint (where object = "chat.completion") and the Responses endpoint (where object = "response"), including GPT-5.1 with reasoning.

Usage

parse_openai_batch_output(
  path,
  tag_prefix = "<BETTER_SAMPLE>",
  tag_suffix = "</BETTER_SAMPLE>"
)

Arguments

path: Path to a JSONL output file downloaded from the OpenAI Batch API.
tag_prefix: Character string marking the start of the better-sample tag. Defaults to "<BETTER_SAMPLE>".
tag_suffix: Character string marking the end of the better-sample tag. Defaults to "</BETTER_SAMPLE>".

Value

A tibble with one row per successfully parsed comparison and columns:

custom_id: The custom_id from the batch request.
ID1, ID2: Sample IDs inferred from custom_id.
model: The model name reported by the API.
object_type: The OpenAI response object type (e.g., "chat.completion" or "response").
status_code: HTTP-style status code from the batch output.
error_message: Error message, if present; otherwise NA.
thoughts: Reasoning / thinking summary text when available (for Responses with reasoning); otherwise NA.
content: The raw assistant visible content string (the LLM's output), used to locate the <BETTER_SAMPLE> tag. For Responses with reasoning this does not include reasoning summaries, which are kept in thoughts.
better_sample: Either "SAMPLE_1", "SAMPLE_2", or NA if the tag was not found.
better_id: ID1 if SAMPLE_1 was chosen, ID2 if SAMPLE_2 was chosen, or NA.
prompt_tokens: Prompt/input token count (if reported).
completion_tokens: Completion/output token count (if reported).
total_tokens: Total tokens (if reported).
prompt_cached_tokens: Cached prompt tokens (if reported via input_tokens_details$cached_tokens); otherwise NA.
reasoning_tokens: Reasoning tokens (if reported via output_tokens_details$reasoning_tokens); otherwise NA.

Details

For each line, the function:

extracts custom_id and parses ID1 and ID2 from the pattern "<prefix>ID1_vs_ID2",
pulls the raw LLM content containing the <BETTER_SAMPLE>...</BETTER_SAMPLE> tag,
determines whether SAMPLE_1 or SAMPLE_2 was selected and maps that to better_id,
collects model name and token usage statistics (including reasoning tokens for GPT-5.1 Responses),
when using the Responses endpoint with reasoning, separates reasoning summaries into the thoughts column and visible assistant output into content.

The returned data frame is suitable as input for build_bt_data.

Examples

# Create a temporary JSONL file containing a simulated OpenAI batch result
tf <- tempfile(fileext = ".jsonl")

# A single line of JSON representing a successful Chat Completion
# custom_id implies "LIVE_" prefix, ID1="A", ID2="B"
json_line <- paste0(
  '{"custom_id": "LIVE_A_vs_B", ',
  '"response": {"status_code": 200, "body": {',
  '"object": "chat.completion", ',
  '"model": "gpt-4", ',
  '"choices": [{"message": {"content": "<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE>"}}], ',
  '"usage": {"prompt_tokens": 50, "completion_tokens": 10, "total_tokens": 60}}}}'
)

writeLines(json_line, tf)

# Parse the output
res <- parse_openai_batch_output(tf)

# Inspect the result
print(res$better_id)
#> [1] "A"
print(res$prompt_tokens)
#> [1] 50

# Clean up
unlink(tf)