Evaluating Gemini 2.5 Flash on R coding tasks | Simon P. Couch

Google’s preview of their Gemini 2.5 Pro model has really made a splash. The model has become many folks’ daily driver, and I’ve started to see “What about Gemini?” in the comments of each of these blog posts if they don’t explicitly call out the model series in the title. Yesterday, Google announced an update of the preview for Gemini 2.5 Flash, a smaller and cheaper version of 2.5 Pro.

In the model card, Google juxtaposes Gemini 2.5 Flash with OpenAI’s o4-mini:

This comparison especially caught my eye, given that o4-mini is the current leader in the class of cheap, snappy thinking models on an R coding evaluation I’ve been running the last few months. The proposition seems to be “o4-mini-ish performance at a fraction of the price.”

In this post, I’ll use the vitals package to compare Gemini 2.5 Flash against several other models:

Gemini 2.0 Flash, the previous generation of this series
Gemini 2.5 Pro, the more performant and expensive version of the model
GPT o4-mini, supposedly a peer in performance and a leader on this eval in the class of cheap and snappy reasoning models
Claude 3.7 Sonnet, my daily driver for coding assistance

tl;dr

2.5 Flash’s performance is really impressive for its price point and is a marked improvement over its previous generation.

o4-mini does show significantly stronger performance on this eval (though is notably more expensive).

Unlike Claude 3.7 Sonnet, enabling thinking with Gemini 2.5 Flash resulted in a marked increase in performance.

Gemini 2.5 Flash with thinking disabled is nearly indistinguishable from GPT 4.1-nano on this eval. 2.5 Flash reached an accuracy of 43.6% at a cost of $0.15/m input, $0.60/m output, which 4.1-nano scored 44.2% at a cost of $0.10/m input, $0.40/m output. 4.1-nano continues to pack the greatest punch in the budget, non-thinking price point.

I’m starting to think about an initial CRAN release of vitals. Please do give it a whir and let me know if you have any feedback.

Setting up the evaluation

Let’s start by defining our model connections using ellmer:

library(ellmer)
library(vitals)
library(tidyverse)

# thinking is enabled by default, and can be disabled by
# with a magic incantation
# https://ai.google.dev/gemini-api/docs/thinking
gemini_2_5_flash_thinking <- chat_google_gemini(
  model = "gemini-2.5-flash-preview-05-20"
)

gemini_2_5_flash_non_thinking <- chat_google_gemini(
  model = "gemini-2.5-flash-preview-05-20",
  api_args = list(
    generationConfig = list(
      thinkingConfig = list(
        thinkingBudget = 0
        )
      )
    )
)

gemini_2_0_flash <- chat_google_gemini(
  model = "gemini-2.0-flash"
)
gemini_2_5_pro <- chat_google_gemini(
  model = "gemini-2.5-pro-preview-05-06"
)

gpt_o4_mini <- chat_openai(model = "o4-mini-2025-04-16")

# note that i don't enable thinking here; thinking 
# doesn't seem to have an effect for claude on this 
# eval: https://www.simonpcouch.com/blog/2025-04-18-o3-o4-mini/
claude_sonnet_3_7 <- chat_anthropic(model = "claude-3-7-sonnet-latest")

Note that I needed to configure GOOGLE_API_KEY, ANTHROPIC_API_KEY, and OPENAI_API_KEY environment variables to connect to these models. The pricing for these models varies considerably:

# A tibble: 6 × 3
  Name                            Input Output
  <chr>                           <chr> <chr> 
1 Gemini 2.5 Flash (Thinking)     $0.15 $3.50 
2 Gemini 2.5 Flash (Non-thinking) $0.15 $0.60 
3 Gemini 2.0 Flash                $0.10 $0.40 
4 Gemini 2.5 Pro                  $1.25 $10.00
5 GPT o4-mini                     $1.10 $4.40 
6 Claude 3.7 Sonnet               $3.00 $15.00

Gemini 2.5 Flash has a thinking and non-thinking mode, where thinking tokens are not surfaced to the user but output tokens are charged at a higher rate. With thinking enabled (as shown on the model card), Gemini 2.5 Flash’s output tokens are priced somewhat similarly to o4-mini.

Gemini 2.5 Pro, Gemini 2.5 Flash (Thinking), and GPT o4-mini are reasoning models, and thus will use more tokens than non-reasoning models. While Claude 3.7 Sonnet has a reasoning mode that could be enabled here, I haven’t done so for this eval as it doesn’t seem to make a difference for performance on this eval.

Let’s set up a task that will evaluate each model using the are dataset from vitals:

are_task <- Task$new(
  dataset = are,
  solver = generate(),
  scorer = model_graded_qa(
    scorer_chat = claude_sonnet_3_7, 
    partial_credit = TRUE
  ),
  epochs = 3,
  name = "An R Eval"
)

are_task

An evaluation task An-R-Eval.

Note

See my first post on Gemini 2.5 Pro for a more thorough description of this evaluation.

Running the evaluations

First, we’ll evaluate our reference model, Gemini 2.5 Flash with thinking enable:

are_gemini_2_5_flash_thinking <- are_task$clone()
are_gemini_2_5_flash_thinking$eval(
  solver_chat = gemini_2_5_flash_thinking
)

From here, it’s pretty rote. The same model without thinking enabled:

are_gemini_2_5_flash_non_thinking <- are_task$clone()
are_gemini_2_5_flash_non_thinking$eval(
  solver_chat = gemini_2_5_flash_non_thinking
)

Now for the other Gemini models:

are_gemini_2_0_flash <- are_task$clone()
are_gemini_2_0_flash$eval(solver_chat = gemini_2_0_flash)

are_gemini_2_5_pro <- are_task$clone()
are_gemini_2_5_pro$eval(solver_chat = gemini_2_5_pro)

Next, we’ll evaluate GPT o4-mini:

are_gpt_o4_mini <- are_task$clone()
are_gpt_o4_mini$eval(solver_chat = gpt_o4_mini)

Finally, let’s evaluate Claude 3.7 Sonnet:

are_claude_3_7 <- are_task$clone()
are_claude_3_7$eval(solver_chat = claude_sonnet_3_7)

The interactive viewer will allow us to inspect the evaluation in detail:

Note

While the total durations of the evaluations are correct in the viewer, the timings of specific samples are now off. Given some changes in downstream packages, vitals has to estimate how long a given request takes rather than receiving the exact duration; this will be resolved down the line.

Analysis

Let’s combine the results of all evaluations to compare the models:

are_eval <- 
  vitals_bind(
    `Gemini 2.5 Flash (Thinking)` = are_gemini_2_5_flash_thinking,
    `Gemini 2.5 Flash (Non-thinking)` = are_gemini_2_5_flash_non_thinking,
    `Gemini 2.0 Flash` = are_gemini_2_0_flash,
    `Gemini 2.5 Pro` = are_gemini_2_5_pro,
    `GPT o4-mini` = are_gpt_o4_mini,
    `Claude Sonnet 3.7` = are_claude_3_7
  ) %>%
  rename(model = task) %>%
  mutate(
    model = factor(model, levels = c(
      "Gemini 2.5 Flash (Thinking)",
      "Gemini 2.5 Flash (Non-thinking)",
      "Gemini 2.0 Flash",
      "Gemini 2.5 Pro",
      "GPT o4-mini",
      "Claude Sonnet 3.7"
    ))
  )

are_eval

# A tibble: 468 × 5
   model                       id                 epoch score metadata
   <fct>                       <chr>              <int> <ord> <list>  
 1 Gemini 2.5 Flash (Thinking) after-stat-bar-he…     1 I     <tibble>
 2 Gemini 2.5 Flash (Thinking) after-stat-bar-he…     2 I     <tibble>
 3 Gemini 2.5 Flash (Thinking) after-stat-bar-he…     3 I     <tibble>
 4 Gemini 2.5 Flash (Thinking) conditional-group…     1 P     <tibble>
 5 Gemini 2.5 Flash (Thinking) conditional-group…     2 C     <tibble>
 6 Gemini 2.5 Flash (Thinking) conditional-group…     3 C     <tibble>
 7 Gemini 2.5 Flash (Thinking) correlated-delays…     1 P     <tibble>
 8 Gemini 2.5 Flash (Thinking) correlated-delays…     2 P     <tibble>
 9 Gemini 2.5 Flash (Thinking) correlated-delays…     3 P     <tibble>
10 Gemini 2.5 Flash (Thinking) curl-http-get          1 C     <tibble>
# ℹ 458 more rows

Let’s visualize the results with a bar chart:

are_eval %>%
  mutate(
    score = fct_recode(
      score, 
      "Correct" = "C", "Partially Correct" = "P", "Incorrect" = "I"
    ),
  ) %>%
  ggplot(aes(y = model, fill = score)) +
  geom_bar(position = "fill") +
  scale_fill_manual(
    breaks = rev,
    values = c("Correct" = "#67a9cf", 
               "Partially Correct" = "#f6e8c3", 
               "Incorrect" = "#ef8a62")
  ) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    x = "Percent", y = "Model",
    title = "An R Eval",
    subtitle = "The Gemini 2.5 Flash models represent a middle-ground between 2.5 Pro and\no4-mini, both in terms of price and performance."
  ) +
  theme(
    plot.subtitle = element_text(face = "italic"),
    legend.position = "bottom"
  )

A horizontal bar chart comparing various AI models' performance on R coding tasks. The chart shows percentages of correct (blue), partially correct (beige), and incorrect (orange) answers. Gemini 2.5 Flash demonstrates performance somewhere between 2.5 Pro and o4-mini, with thinking resulting in a 10% increase in the proportion of correct answers. Claude 3.7 Sonnet and o4-mini remain the top performers.

To determine if the differences we’re seeing are statistically significant, we’ll use a cumulative link mixed model:

library(ordinal)

are_mod <- clmm(score ~ model + (1|id), data = are_eval)

summary(are_mod)

Cumulative Link Mixed Model fitted with the Laplace approximation

formula: score ~ model + (1 | id)
data:    are_eval

 link  threshold nobs logLik  AIC    niter     max.grad cond.H 
 logit flexible  468  -375.15 766.30 397(1971) 8.31e-05 6.4e+01

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 5.523    2.35    
Number of groups:  id 26 

Coefficients:
                                     Estimate Std. Error z value
modelGemini 2.5 Flash (Non-thinking)  -0.8506     0.3604  -2.360
modelGemini 2.0 Flash                 -1.3338     0.3698  -3.607
modelGemini 2.5 Pro                    0.6600     0.3611   1.828
modelGPT o4-mini                       0.7887     0.3620   2.179
modelClaude Sonnet 3.7                 0.3740     0.3567   1.048
                                     Pr(>|z|)    
modelGemini 2.5 Flash (Non-thinking)  0.01826 *  
modelGemini 2.0 Flash                 0.00031 ***
modelGemini 2.5 Pro                   0.06759 .  
modelGPT o4-mini                      0.02934 *  
modelClaude Sonnet 3.7                0.29441    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
I|P  -1.3191     0.5404  -2.441
P|C   0.4820     0.5361   0.899

For the purposes of this post, we’ll just take a look at the Coefficients table. The reference model here is Gemini 2.5 Flash. Negative coefficient estimates for a given model indicate that model is less likely to receive higher ratings than Gemini 2.5 Flash. Looking at the coefficients:

Gemini 2.5 Flash is a marked improvement over its previous generation on this eval.
Gemini 2.5 Flash significantly lags behind o4-mini on this eval.
Enabling Gemini 2.5 Flash’s thinking results in a marked increase in performance over the non-thinking model, though brings the pricing much closer to the more performant o4-mini.
Gemini 2.5 Flash with thinking disabled is nearly indistinguishable from GPT 4.1-nano on this eval. 2.5 Flash reached an accuracy of 43.6% at a cost of $0.15/m input, $0.60/m output, which 4.1-nano scored 44.2% at a cost of $0.10/m input, $0.40/m output. 4.1-nano continues to pack the greatest punch in the budget, non-thinking price point.

One more note before I wrap up: For the past month or two, development on ellmer and vitals has been quite coupled to support answering common questions about LLM performance. With the release of ellmer 0.2.0 on CRAN last week, I’m starting to gear up for an initial CRAN release of vitals here soon. In the meantime, I’m especially interested in feedback from folks who have given the package a go! Do let me know if you give it a whir and run into any hiccups.

Thank you to Max Kuhn for advising on the model-based analysis here.

In a first for this blog, I tried using a model to help me write this post. In general, I don’t tend to use models to help with writing at all. Now that I’ve written a good few of these posts to pattern-match from, I wondered if Claude 3.7 Sonnet could draft a reasonable starting place. I used this prompt; as usual, I ended up deleting all of the prose that the model wrote, but it was certainly a boost to have all of the code written for me.

Reuse

CC BY-SA 4.0