Claude 4 and R Coding | Simon P. Couch

Claude 4 dropped on Thursday! Given that Claude 3.7 Sonnet is my daily driver LLM for R coding, I’ve been excited to poke at it.

The last few months, I’ve been writing a series of blog posts where I evaluate new LLM releases on their R coding performance. I do so entirely in R using the ellmer and vitals packages, the latter of which will be headed to CRAN in the coming weeks. In this post, I’ll skip over all of the evaluation code and just make some graphs; if you’re interested in learning more about how to run an eval like this one, check out the post Evaluating o3 and o4-mini on R coding performance.

Here’s the gist:

An R Eval is a dataset of challenging R coding problems.
We’ll run an evaluation on that dataset on Claude 4 Sonnet, Claude 4 Opus (which Anthropic made a point to note had impressive coding performance), Claude 3.7 Sonnet (my previous coding daily driver), and o4-mini (up to this point, the most performant model in this eval).
Using the results, we can measure how well different models solved the R coding problems, how many tokens (and thus dollars) they used to get to those answers, and how long it took them to finish their answer.

library(ellmer)
library(vitals)
library(tidyverse)
library(ggrepel)

Full evaluation code here

claude_4_sonnet <- chat_anthropic(model = "claude-sonnet-4-20250514")
claude_4_opus <- chat_anthropic(model = "claude-opus-4-20250514")
claude_3_7_sonnet <- chat_anthropic(model = "claude-3-7-sonnet-latest")
gpt_o4_mini <- chat_openai(model = "o4-mini-2025-04-16")

are_task <- Task$new(
  dataset = are,
  solver = generate(),
  scorer = model_graded_qa(
    scorer_chat = claude_3_7_sonnet, 
    partial_credit = TRUE
  ),
  epochs = 3,
  name = "An R Eval"
)

are_task

are_claude_4_sonnet <- are_task$clone()
are_claude_4_sonnet$eval(solver_chat = claude_4_sonnet)
are_claude_4_sonnet <- vitals:::scrub_providers(are_claude_4_sonnet)
save(are_claude_4_sonnet, file = "blog/2025-05-27-claude-4/tasks/are_claude_4_sonnet.rda")

are_claude_4_opus <- are_task$clone()
are_claude_4_opus$eval(solver_chat = claude_4_opus)
are_claude_4_opus <- vitals:::scrub_providers(are_claude_4_opus)
save(are_claude_4_opus, file = "blog/2025-05-27-claude-4/tasks/are_claude_4_opus.rda")

are_claude_3_7_sonnet <- are_task$clone()
are_claude_3_7_sonnet$eval(solver_chat = claude_3_7_sonnet)
are_claude_3_7_sonnet <- vitals:::scrub_providers(are_claude_3_7_sonnet)
save(are_claude_3_7_sonnet, file = "blog/2025-05-27-claude-4/tasks/are_claude_3_7_sonnet.rda")

are_gpt_o4_mini <- are_task$clone()
are_gpt_o4_mini$eval(solver_chat = gpt_o4_mini)
are_gpt_o4_mini <- vitals:::scrub_providers(are_gpt_o4_mini)
save(are_gpt_o4_mini, file = "blog/2025-05-27-claude-4/tasks/are_gpt_o4_mini.rda")

You can view the raw results of the evaluation in this interactive viewer:

Note

While the total durations of the evaluations are correct in the viewer, the timings of specific samples are now estimated. Given some changes in downstream packages, vitals has to estimate how long a given request takes rather than receiving the exact duration; this will be resolved down the line.

Analysis

At this point, we have access to are_eval, a data frame containing all of the results collected during the evaluation.

are_eval

# A tibble: 312 × 5
   model           id                          epoch score metadata
   <fct>           <chr>                       <int> <ord> <list>  
 1 Claude 4 Sonnet after-stat-bar-heights          1 I     <tibble>
 2 Claude 4 Sonnet after-stat-bar-heights          2 I     <tibble>
 3 Claude 4 Sonnet after-stat-bar-heights          3 I     <tibble>
 4 Claude 4 Sonnet conditional-grouped-summary     1 C     <tibble>
 5 Claude 4 Sonnet conditional-grouped-summary     2 P     <tibble>
 6 Claude 4 Sonnet conditional-grouped-summary     3 C     <tibble>
 7 Claude 4 Sonnet correlated-delays-reasoning     1 P     <tibble>
 8 Claude 4 Sonnet correlated-delays-reasoning     2 P     <tibble>
 9 Claude 4 Sonnet correlated-delays-reasoning     3 I     <tibble>
10 Claude 4 Sonnet curl-http-get                   1 C     <tibble>
# ℹ 302 more rows

The evaluation scores each answer as “Correct”, “Partially Correct”, or “Incorrect”. We can use a bar chart to visualize the proportions of responses that fell into each of those categories:

are_eval %>%
  mutate(
    score = fct_recode(
      score, 
      "Correct" = "C", "Partially Correct" = "P", "Incorrect" = "I"
    ),
  ) %>%
  ggplot(aes(y = model, fill = score)) +
  geom_bar(position = "fill") +
  scale_fill_manual(
    breaks = rev,
    values = c("Correct" = "#67a9cf", 
               "Partially Correct" = "#f6e8c3", 
               "Incorrect" = "#ef8a62")
  ) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    x = "Percent", y = "Model",
    title = "An R Eval",
    subtitle = "Claude 4 models represent a step forward in R coding performance."
  ) +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    panel.background = element_rect(fill = "#f8f8f1", color = NA),
    plot.background = element_rect(fill = "#f8f8f1", color = NA),
    legend.background = element_rect(fill = "#F3F3EE", color = NA),
    plot.subtitle = element_text(face = "italic"),
    legend.position = "bottom"
  )

A horizontal bar chart comparing various AI models' performance on R coding tasks. The chart shows percentages of correct (blue), partially correct (beige), and incorrect (orange) answers. Claude 4 Opus has the highest proportion of correct answers, followed by o4-mini, Claude 4 Sonnet, and Claude 3.7 Sonnet.

The pricing per token of each of these models differs quite a bit, though, and o4-mini might use many more tokens during its “reasoning.” How much did it cost to run each of these evals, and what’s the resulting cost-per-performance?

The pricing per million tokens for these models is as follows:

# A tibble: 4 × 3
  Name              Input  Output
  <chr>             <chr>  <chr> 
1 Claude 4 Sonnet   $3.00  $15.00
2 Claude 4 Opus     $15.00 $75.00
3 Claude 3.7 Sonnet $3.00  $15.00
4 o4-mini           $1.10  $4.40

Under the hood, I’ve calculated the total cost of running the eval for each model using the shown pricing and joined it to the evaluation results:

costs

               Name     Price    Score
1   Claude 4 Sonnet 0.7204620 59.61538
2     Claude 4 Opus 3.3161850 66.02564
3 Claude 3.7 Sonnet 0.4359000 57.05128
4           o4-mini 0.6164763 65.38462

ggplot(costs) +
  aes(x = Price, y = Score, label = Name) +
  geom_point(size = 3, color = "#4A7862") +
  geom_label_repel() +
  scale_y_continuous(limits = c(55, 70), labels = function(x) paste0(x, "%")) +
  labs(
    x = "Total Price (USD)",
    y = "Score",
    title = "R Coding Performance vs. Cost",
    subtitle = "o4-mini packs the most punch at its price point, though\nClaude 4 Opus is state-of-the-art."
  ) +
  lims(x = c(0, 4))

A scatterplot comparing the price and performance of four LLMs. While the Claude Sonnet models and o4-mini are clustered around $1, o4-mini shows much stronger performance. Claude 4 Opus delivers slightly stronger performance than o4-mini, though is much more expensive at $4 total.

While Claude 4 Opus is the new SotA on this eval, o4-mini almost matches its performance at a small fraction of the price.

To determine if the differences we’re seeing are statistically significant, we’ll use a cumulative link mixed model:

library(ordinal)

are_mod <- clmm(score ~ model + (1|id), data = are_eval)

summary(are_mod)

Cumulative Link Mixed Model fitted with the Laplace approximation

formula: score ~ model + (1 | id)
data:    are_eval

 link  threshold nobs logLik  AIC    niter     max.grad cond.H 
 logit flexible  312  -205.43 422.85 234(1729) 7.06e-06 2.0e+01

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 13.56    3.682   
Number of groups:  id 26 

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
modelClaude 4 Opus       0.6190     0.4126   1.500    0.134
modelClaude 3.7 Sonnet  -0.3293     0.4115  -0.800    0.424
modelGPT o4-mini         0.5753     0.4236   1.358    0.174

Threshold coefficients:
    Estimate Std. Error z value
I|P -2.25743    0.43244  -5.220
P|C -0.04208    0.48543  -0.087

For the purposes of this post, we’ll just take a look at the Coefficients table. The reference model here is Claude 4 Sonnet. Negative coefficient estimates for a given model indicate that model is less likely to receive higher ratings than Claude 4 Sonnet. Looking at the coefficients, while Claude 4 Sonnet seems like an improvement upon its previous generation, we don’t see evidence of statistically significantly differing performance on this eval from Claude 4 Sonnet and any other model.

Thank you to Max Kuhn for advising on the model-based analysis here.

Reuse

CC BY-SA 4.0