library(ellmer)
library(vitals)
library(tidyverse)
# thinking is enabled by default, and can be disabled by
# with a magic incantation
# https://ai.google.dev/gemini-api/docs/thinking
gemini_2_5_flash_thinking <- chat_google_gemini(
model = "gemini-2.5-flash-preview-05-20"
)
gemini_2_5_flash_non_thinking <- chat_google_gemini(
model = "gemini-2.5-flash-preview-05-20",
api_args = list(
generationConfig = list(
thinkingConfig = list(
thinkingBudget = 0
)
)
)
)
gemini_2_0_flash <- chat_google_gemini(
model = "gemini-2.0-flash"
)
gemini_2_5_pro <- chat_google_gemini(
model = "gemini-2.5-pro-preview-05-06"
)
gpt_o4_mini <- chat_openai(model = "o4-mini-2025-04-16")
# note that i don't enable thinking here; thinking
# doesn't seem to have an effect for claude on this
# eval: https://www.simonpcouch.com/blog/2025-04-18-o3-o4-mini/
claude_sonnet_3_7 <- chat_anthropic(model = "claude-3-7-sonnet-latest")
Google’s preview of their Gemini 2.5 Pro model has really made a splash. The model has become many folks’ daily driver, and I’ve started to see “What about Gemini?” in the comments of each of these blog posts if they don’t explicitly call out the model series in the title. Yesterday, Google announced an update of the preview for Gemini 2.5 Flash, a smaller and cheaper version of 2.5 Pro.
In the model card, Google juxtaposes Gemini 2.5 Flash with OpenAI’s o4-mini:
This comparison especially caught my eye, given that o4-mini is the current leader in the class of cheap, snappy thinking models on an R coding evaluation I’ve been running the last few months. The proposition seems to be “o4-mini-ish performance at a fraction of the price.”
In this post, I’ll use the vitals package to compare Gemini 2.5 Flash against several other models:
- Gemini 2.0 Flash, the previous generation of this series
- Gemini 2.5 Pro, the more performant and expensive version of the model
- GPT o4-mini, supposedly a peer in performance and a leader on this eval in the class of cheap and snappy reasoning models
- Claude 3.7 Sonnet, my daily driver for coding assistance
tl;dr
- 2.5 Flash’s performance is really impressive for its price point and is a marked improvement over its previous generation.
- o4-mini does show significantly stronger performance on this eval (though is notably more expensive).
- Unlike Claude 3.7 Sonnet, enabling thinking with Gemini 2.5 Flash resulted in a marked increase in performance.
- Gemini 2.5 Flash with thinking disabled is nearly indistinguishable from GPT 4.1-nano on this eval. 2.5 Flash reached an accuracy of 43.6% at a cost of $0.15/m input, $0.60/m output, which 4.1-nano scored 44.2% at a cost of $0.10/m input, $0.40/m output. 4.1-nano continues to pack the greatest punch in the budget, non-thinking price point.
- I’m starting to think about an initial CRAN release of vitals. Please do give it a whir and let me know if you have any feedback.
Setting up the evaluation
Let’s start by defining our model connections using ellmer:
Note that I needed to configure GOOGLE_API_KEY
, ANTHROPIC_API_KEY
, and OPENAI_API_KEY
environment variables to connect to these models. The pricing for these models varies considerably:
# A tibble: 6 × 3
Name Input Output
<chr> <chr> <chr>
1 Gemini 2.5 Flash (Thinking) $0.15 $3.50
2 Gemini 2.5 Flash (Non-thinking) $0.15 $0.60
3 Gemini 2.0 Flash $0.10 $0.40
4 Gemini 2.5 Pro $1.25 $10.00
5 GPT o4-mini $1.10 $4.40
6 Claude 3.7 Sonnet $3.00 $15.00
Gemini 2.5 Flash has a thinking and non-thinking mode, where thinking tokens are not surfaced to the user but output tokens are charged at a higher rate. With thinking enabled (as shown on the model card), Gemini 2.5 Flash’s output tokens are priced somewhat similarly to o4-mini.
Gemini 2.5 Pro, Gemini 2.5 Flash (Thinking), and GPT o4-mini are reasoning models, and thus will use more tokens than non-reasoning models. While Claude 3.7 Sonnet has a reasoning mode that could be enabled here, I haven’t done so for this eval as it doesn’t seem to make a difference for performance on this eval.
Let’s set up a task that will evaluate each model using the are
dataset from vitals:
are_task <- Task$new(
dataset = are,
solver = generate(),
scorer = model_graded_qa(
scorer_chat = claude_sonnet_3_7,
partial_credit = TRUE
),
epochs = 3,
name = "An R Eval"
)
are_task
An evaluation task An-R-Eval.
See my first post on Gemini 2.5 Pro for a more thorough description of this evaluation.
Running the evaluations
First, we’ll evaluate our reference model, Gemini 2.5 Flash with thinking enable:
are_gemini_2_5_flash_thinking <- are_task$clone()
are_gemini_2_5_flash_thinking$eval(
solver_chat = gemini_2_5_flash_thinking
)
From here, it’s pretty rote. The same model without thinking enabled:
are_gemini_2_5_flash_non_thinking <- are_task$clone()
are_gemini_2_5_flash_non_thinking$eval(
solver_chat = gemini_2_5_flash_non_thinking
)
Now for the other Gemini models:
are_gemini_2_0_flash <- are_task$clone()
are_gemini_2_0_flash$eval(solver_chat = gemini_2_0_flash)
are_gemini_2_5_pro <- are_task$clone()
are_gemini_2_5_pro$eval(solver_chat = gemini_2_5_pro)
Next, we’ll evaluate GPT o4-mini:
are_gpt_o4_mini <- are_task$clone()
are_gpt_o4_mini$eval(solver_chat = gpt_o4_mini)
Finally, let’s evaluate Claude 3.7 Sonnet:
are_claude_3_7 <- are_task$clone()
are_claude_3_7$eval(solver_chat = claude_sonnet_3_7)
The interactive viewer will allow us to inspect the evaluation in detail:
While the total durations of the evaluations are correct in the viewer, the timings of specific samples are now off. Given some changes in downstream packages, vitals has to estimate how long a given request takes rather than receiving the exact duration; this will be resolved down the line.
Analysis
Let’s combine the results of all evaluations to compare the models:
are_eval <-
vitals_bind(
`Gemini 2.5 Flash (Thinking)` = are_gemini_2_5_flash_thinking,
`Gemini 2.5 Flash (Non-thinking)` = are_gemini_2_5_flash_non_thinking,
`Gemini 2.0 Flash` = are_gemini_2_0_flash,
`Gemini 2.5 Pro` = are_gemini_2_5_pro,
`GPT o4-mini` = are_gpt_o4_mini,
`Claude Sonnet 3.7` = are_claude_3_7
) %>%
rename(model = task) %>%
mutate(
model = factor(model, levels = c(
"Gemini 2.5 Flash (Thinking)",
"Gemini 2.5 Flash (Non-thinking)",
"Gemini 2.0 Flash",
"Gemini 2.5 Pro",
"GPT o4-mini",
"Claude Sonnet 3.7"
))
)
are_eval
# A tibble: 468 × 5
model id epoch score metadata
<fct> <chr> <int> <ord> <list>
1 Gemini 2.5 Flash (Thinking) after-stat-bar-he… 1 I <tibble>
2 Gemini 2.5 Flash (Thinking) after-stat-bar-he… 2 I <tibble>
3 Gemini 2.5 Flash (Thinking) after-stat-bar-he… 3 I <tibble>
4 Gemini 2.5 Flash (Thinking) conditional-group… 1 P <tibble>
5 Gemini 2.5 Flash (Thinking) conditional-group… 2 C <tibble>
6 Gemini 2.5 Flash (Thinking) conditional-group… 3 C <tibble>
7 Gemini 2.5 Flash (Thinking) correlated-delays… 1 P <tibble>
8 Gemini 2.5 Flash (Thinking) correlated-delays… 2 P <tibble>
9 Gemini 2.5 Flash (Thinking) correlated-delays… 3 P <tibble>
10 Gemini 2.5 Flash (Thinking) curl-http-get 1 C <tibble>
# ℹ 458 more rows
Let’s visualize the results with a bar chart:
are_eval %>%
mutate(
score = fct_recode(
score,
"Correct" = "C", "Partially Correct" = "P", "Incorrect" = "I"
),
) %>%
ggplot(aes(y = model, fill = score)) +
geom_bar(position = "fill") +
scale_fill_manual(
breaks = rev,
values = c("Correct" = "#67a9cf",
"Partially Correct" = "#f6e8c3",
"Incorrect" = "#ef8a62")
) +
scale_x_continuous(labels = scales::percent) +
labs(
x = "Percent", y = "Model",
title = "An R Eval",
subtitle = "The Gemini 2.5 Flash models represent a middle-ground between 2.5 Pro and\no4-mini, both in terms of price and performance."
) +
theme(
plot.subtitle = element_text(face = "italic"),
legend.position = "bottom"
)
To determine if the differences we’re seeing are statistically significant, we’ll use a cumulative link mixed model:
summary(are_mod)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: score ~ model + (1 | id)
data: are_eval
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 468 -375.15 766.30 397(1971) 8.31e-05 6.4e+01
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 5.523 2.35
Number of groups: id 26
Coefficients:
Estimate Std. Error z value
modelGemini 2.5 Flash (Non-thinking) -0.8506 0.3604 -2.360
modelGemini 2.0 Flash -1.3338 0.3698 -3.607
modelGemini 2.5 Pro 0.6600 0.3611 1.828
modelGPT o4-mini 0.7887 0.3620 2.179
modelClaude Sonnet 3.7 0.3740 0.3567 1.048
Pr(>|z|)
modelGemini 2.5 Flash (Non-thinking) 0.01826 *
modelGemini 2.0 Flash 0.00031 ***
modelGemini 2.5 Pro 0.06759 .
modelGPT o4-mini 0.02934 *
modelClaude Sonnet 3.7 0.29441
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
I|P -1.3191 0.5404 -2.441
P|C 0.4820 0.5361 0.899
For the purposes of this post, we’ll just take a look at the Coefficients
table. The reference model here is Gemini 2.5 Flash. Negative coefficient estimates for a given model indicate that model is less likely to receive higher ratings than Gemini 2.5 Flash. Looking at the coefficients:
- Gemini 2.5 Flash is a marked improvement over its previous generation on this eval.
- Gemini 2.5 Flash significantly lags behind o4-mini on this eval.
- Enabling Gemini 2.5 Flash’s thinking results in a marked increase in performance over the non-thinking model, though brings the pricing much closer to the more performant o4-mini.
- Gemini 2.5 Flash with thinking disabled is nearly indistinguishable from GPT 4.1-nano on this eval. 2.5 Flash reached an accuracy of 43.6% at a cost of $0.15/m input, $0.60/m output, which 4.1-nano scored 44.2% at a cost of $0.10/m input, $0.40/m output. 4.1-nano continues to pack the greatest punch in the budget, non-thinking price point.
One more note before I wrap up: For the past month or two, development on ellmer and vitals has been quite coupled to support answering common questions about LLM performance. With the release of ellmer 0.2.0 on CRAN last week, I’m starting to gear up for an initial CRAN release of vitals here soon. In the meantime, I’m especially interested in feedback from folks who have given the package a go! Do let me know if you give it a whir and run into any hiccups.
Thank you to Max Kuhn for advising on the model-based analysis here.
In a first for this blog, I tried using a model to help me write this post. In general, I don’t tend to use models to help with writing at all. Now that I’ve written a good few of these posts to pattern-match from, I wondered if Claude 3.7 Sonnet could draft a reasonable starting place. I used this prompt; as usual, I ended up deleting all of the prose that the model wrote, but it was certainly a boost to have all of the code written for me.