R-Bench

R-Bench expands on the classic strawberry problem and focuses on extra logic and recall problems for LLMs centered around the letter R. Counting the number of letter R's in words has proven tricky for many LLMs. Especially non-reasoning models. This test takes that to a new level by including extra scenarios related to counting letters in strings.

Generated on 2025-04-14 18:19:46

18
Models Evaluated
21
Total Problems
openai/gpt-4o
Grader Model
openai/o3-mini
Best Performer
1985.39s
Test Time
834.26s
Grade Time
258,565
Total Tokens
Input: 223,190 | Output: 35,375
$1.3153
Total Cost

Model Performance

Model Pass Rate Token Usage Cost
openai/o3-mini
21/21 (100.0%)
15,184 (in: 13,541, out: 1,643) $0.1237 (Solving: $0.0735, Grading: $0.0503)
openai/o1-mini
21/21 (100.0%)
15,557 (in: 13,849, out: 1,708) $0.2269 (Solving: $0.1752, Grading: $0.0517)
openai/gpt-4.1
18/21 (85.7%)
15,730 (in: 13,937, out: 1,793) $0.0890 (Solving: $0.0363, Grading: $0.0528)
openai/gpt-4.1-mini
18/21 (85.7%)
14,440 (in: 12,640, out: 1,800) $0.0548 (Solving: $0.0052, Grading: $0.0496)
anthropic/claude-3-7-sonnet-latest
17/21 (81.0%)
14,201 (in: 12,402, out: 1,799) $0.0981 (Solving: $0.0491, Grading: $0.0490)
openai/gpt-4o-mini
16/21 (76.2%)
13,894 (in: 12,108, out: 1,786) $0.0498 (Solving: $0.0016, Grading: $0.0481)
openai/gpt-4o
14/21 (66.7%)
14,158 (in: 12,144, out: 2,014) $0.0779 (Solving: $0.0274, Grading: $0.0505)
openai/gpt-4.1-nano
12/21 (57.1%)
14,132 (in: 12,289, out: 1,843) $0.0503 (Solving: $0.0012, Grading: $0.0492)
anthropic/claude-3-5-sonnet-latest
11/21 (52.4%)
13,251 (in: 11,396, out: 1,855) $0.0795 (Solving: $0.0324, Grading: $0.0470)
ollama/gemma2:latest
9/21 (42.9%)
13,594 (in: 11,604, out: 1,990) $0.0489 (Solving: $0.0000, Grading: $0.0489)
ollama/phi4:latest
8/21 (38.1%)
15,630 (in: 13,355, out: 2,275) $0.0561 (Solving: $0.0000, Grading: $0.0561)
ollama/llama3.2:latest
8/21 (38.1%)
13,755 (in: 11,642, out: 2,113) $0.0502 (Solving: $0.0000, Grading: $0.0502)
ollama/gemma3:12b
7/21 (33.3%)
13,849 (in: 11,901, out: 1,948) $0.0492 (Solving: $0.0000, Grading: $0.0492)
openai/gpt-3.5-turbo
5/21 (23.8%)
12,474 (in: 10,516, out: 1,958) $0.0477 (Solving: $0.0018, Grading: $0.0459)
ollama/cogito:8b
4/21 (19.0%)
13,644 (in: 11,444, out: 2,200) $0.0506 (Solving: $0.0000, Grading: $0.0506)
ollama/phi3.5:latest
3/21 (14.3%)
16,739 (in: 14,320, out: 2,419) $0.0600 (Solving: $0.0000, Grading: $0.0600)
ollama/qwen2.5-coder:3b
3/21 (14.3%)
14,182 (in: 12,056, out: 2,126) $0.0514 (Solving: $0.0000, Grading: $0.0514)
ollama/cogito:3b
2/21 (9.5%)
14,151 (in: 12,046, out: 2,105) $0.0512 (Solving: $0.0000, Grading: $0.0512)

Problem Details

Index Problem Name
1
Rs - Strawberry

Problem Text:

how many letter R's are in the word strawberry?

Grading Rules:

must state the correct answer

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 0.98
1 1.0 βœ“ 0.93
2 1.0 βœ“ 1.30
3 1.0 βœ“ 1.18
4 1.0 βœ“ 0.93
ollama/gemma2:latest βœ— 0 0.0 βœ— 1.56
1 0.0 βœ— 1.64
2 0.0 βœ— 1.49
3 0.0 βœ— 1.49
4 0.0 βœ— 1.11
ollama/phi4:latest βœ— 0 0.0 βœ— 1.29
1 0.0 βœ— 1.37
2 0.0 βœ— 1.53
3 0.0 βœ— 1.44
4 0.0 βœ— 1.43
ollama/phi3.5:latest βœ“ 0 1.0 βœ“ 1.56
1 0.0 βœ— 2.37
2 0.0 βœ— 1.50
3 1.0 βœ“ 1.20
4 1.0 βœ“ 2.49
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.48
1 1.0 βœ“ 1.67
2 1.0 βœ“ 1.27
3 1.0 βœ“ 1.23
4 1.0 βœ“ 1.57
openai/gpt-4.1-mini βœ— 0 0.0 βœ— 1.32
1 0.0 βœ— 1.53
2 0.0 βœ— 1.49
3 0.0 βœ— 6.99
4 0.0 βœ— 1.99
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.22
1 0.0 βœ— 1.47
2 0.0 βœ— 1.17
3 0.0 βœ— 1.26
4 0.0 βœ— 3.01
openai/gpt-4o βœ“ 0 0.0 βœ— 1.65
1 1.0 βœ“ 1.37
2 1.0 βœ“ 1.21
3 0.0 βœ— 1.16
4 0.0 βœ— 1.22
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 2.02
1 1.0 βœ“ 1.32
2 0.0 βœ— 1.73
3 1.0 βœ“ 1.00
4 1.0 βœ“ 1.11
openai/gpt-3.5-turbo βœ“ 0 0.0 βœ— 1.14
1 0.0 βœ— 1.64
2 0.0 βœ— 1.62
3 1.0 βœ“ 1.35
4 1.0 βœ“ 1.47
openai/o3-mini βœ“ 0 1.0 βœ“ 1.11
1 1.0 βœ“ 0.94
2 1.0 βœ“ 1.79
3 1.0 βœ“ 1.16
4 1.0 βœ“ 1.16
openai/o1-mini βœ“ 0 1.0 βœ“ 1.05
1 1.0 βœ“ 1.26
2 0.0 βœ— 1.24
3 1.0 βœ“ 1.31
4 1.0 βœ“ 1.49
anthropic/claude-3-7-sonnet-latest βœ— 0 0.0 βœ— 1.41
1 0.0 βœ— 1.28
2 0.0 βœ— 1.64
3 0.0 βœ— 1.66
4 0.0 βœ— 1.18
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.39
1 0.0 βœ— 1.53
2 0.0 βœ— 1.68
3 0.0 βœ— 2.26
4 0.0 βœ— 1.37
ollama/cogito:8b βœ“ 0 0.0 βœ— 2.58
1 1.0 βœ“ 1.52
2 0.0 βœ— 1.56
3 0.0 βœ— 1.15
4 1.0 βœ“ 1.08
ollama/cogito:3b βœ— 0 0.0 βœ— 1.79
1 0.0 βœ— 1.43
2 0.0 βœ— 1.05
3 0.0 βœ— 1.68
4 0.0 βœ— 1.84
ollama/qwen2.5-coder:3b βœ“ 0 1.0 βœ“ 1.48
1 0.0 βœ— 1.88
2 0.0 βœ— 1.03
3 0.0 βœ— 1.69
4 0.0 βœ— 1.86
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.10
1 0.0 βœ— 1.37
2 0.0 βœ— 1.21
3 0.0 βœ— 1.38
4 0.0 βœ— 1.25
2
Rs - Complex

Problem Text:

how many letter R's are in the sentence "ravens roost on a berry colored rangerover"?

Grading Rules:

must state the correct answer

Correct Answer:

8

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.21
ollama/gemma2:latest βœ— 0 0.0 βœ— 1.37
ollama/phi4:latest βœ— 0 0.0 βœ— 2.13
ollama/phi3.5:latest βœ— 0 0.0 βœ— 1.40
openai/gpt-4.1 βœ— 0 0.0 βœ— 1.66
openai/gpt-4.1-mini βœ— 0 0.0 βœ— 1.50
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.54
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.10
openai/gpt-4o-mini βœ— 0 0.0 βœ— 0.96
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.32
openai/o3-mini βœ“ 0 1.0 βœ“ 2.08
openai/o1-mini βœ“ 0 1.0 βœ“ 1.75
anthropic/claude-3-7-sonnet-latest βœ— 0 0.0 βœ— 1.32
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.23
ollama/cogito:8b βœ— 0 0.0 βœ— 1.71
ollama/cogito:3b βœ— 0 0.0 βœ— 1.40
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.51
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.11
3
Count Rs - Ignore Last Word

Problem Text:

Ignoring the last word, how many letter R's are in the phrase "rural road runner" ignoring vowels?

Grading Rules:

must correctly ignore vowels and count R's

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 1.36
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.78
ollama/phi4:latest βœ— 0 0.0 βœ— 2.65
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.43
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 5.71
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.41
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.64
openai/gpt-4o βœ— 0 0.0 βœ— 2.96
openai/gpt-4o-mini βœ— 0 0.0 βœ— 2.42
openai/gpt-3.5-turbo βœ“ 0 1.0 βœ“ 1.28
openai/o3-mini βœ“ 0 1.0 βœ“ 1.62
openai/o1-mini βœ“ 0 1.0 βœ“ 1.55
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.91
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.10
ollama/cogito:8b βœ“ 0 1.0 βœ“ 1.14
ollama/cogito:3b βœ— 0 0.0 βœ— 1.39
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.50
ollama/llama3.2:latest βœ— 0 0.0 βœ— 6.12
4
Count Rs - Ignore first word

Problem Text:

Ignoring the first word, how many letter R's are in the phrase "red roses rarely ripen right"?

Grading Rules:

must correctly ignore first word

Correct Answer:

5

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.29
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.76
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.67
ollama/phi3.5:latest βœ— 0 0.0 βœ— 4.97
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.18
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.35
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.63
openai/gpt-4o βœ— 0 0.0 βœ— 2.10
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.63
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.38
openai/o3-mini βœ“ 0 1.0 βœ“ 1.50
openai/o1-mini βœ“ 0 1.0 βœ“ 1.51
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.57
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.28
ollama/cogito:8b βœ— 0 0.0 βœ— 1.78
ollama/cogito:3b βœ— 0 0.0 βœ— 1.61
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 0.93
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.83
5
Count Rs - Only first and last word

Problem Text:

Counting letters in only the first and last words, how many R's are in "royal ravens rarely roam"?

Grading Rules:

must correctly count only first and last words

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 1.51
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.50
ollama/phi4:latest βœ— 0 0.0 βœ— 2.18
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.45
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 3.59
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.99
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.65
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.69
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.54
openai/gpt-3.5-turbo βœ“ 0 1.0 βœ“ 1.47
openai/o3-mini βœ“ 0 1.0 βœ“ 1.64
openai/o1-mini βœ“ 0 1.0 βœ“ 1.10
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.69
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.49
ollama/cogito:8b βœ— 0 0.0 βœ— 1.88
ollama/cogito:3b βœ“ 0 1.0 βœ“ 1.65
ollama/qwen2.5-coder:3b βœ“ 0 1.0 βœ“ 1.24
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 2.08
6
Count Rs - Capital letters only

Problem Text:

How many uppercase letter R's are in "Rural Roaming Rancho"?

Grading Rules:

must count only uppercase letters

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.36
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.63
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.48
ollama/phi3.5:latest βœ— 0 0.0 βœ— 3.02
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.81
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.10
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.56
openai/gpt-4o βœ— 0 0.0 βœ— 1.55
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.39
openai/gpt-3.5-turbo βœ“ 0 1.0 βœ“ 0.95
openai/o3-mini βœ“ 0 1.0 βœ“ 6.04
openai/o1-mini βœ“ 0 1.0 βœ“ 1.24
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.34
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.00
ollama/cogito:8b βœ“ 0 1.0 βœ“ 1.39
ollama/cogito:3b βœ— 0 0.0 βœ— 1.38
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.02
ollama/llama3.2:latest βœ— 0 0.0 βœ— 6.62
7
Count Rs - Exclude repeated words

Problem Text:

Excluding duplicate words, how many R's are in "rare rare rabbits run run rapidly"?

Grading Rules:

must count only unique words

Correct Answer:

5

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 3.28
ollama/gemma2:latest βœ— 0 0.0 βœ— 4.30
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.94
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.75
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.32
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.58
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.98
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.44
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.25
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.99
openai/o3-mini βœ“ 0 1.0 βœ“ 1.26
openai/o1-mini βœ“ 0 1.0 βœ“ 2.20
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 5.66
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 2.52
ollama/cogito:8b βœ— 0 0.0 βœ— 1.99
ollama/cogito:3b βœ— 0 0.0 βœ— 2.84
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.71
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.60
8
Count Rs - Ignore letters after S

Problem Text:

Ignoring letters in words after the first S appears in the word, how many R's are in "restored servers start readily"?

Grading Rules:

must correctly ignore letters after S

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.92
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 2.21
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.30
ollama/phi3.5:latest βœ— 0 0.0 βœ— 4.79
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.74
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 2.00
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 2.12
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.71
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.43
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 2.18
openai/o3-mini βœ“ 0 1.0 βœ“ 1.61
openai/o1-mini βœ“ 0 1.0 βœ“ 1.75
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.27
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 2.48
ollama/cogito:8b βœ— 0 0.0 βœ— 2.72
ollama/cogito:3b βœ— 0 0.0 βœ— 1.96
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.96
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.61
9
Count Rs - Only middle word

Problem Text:

Only looking at the middle word in the phrase, how many R's are there in "rapid rain rarely"?

Grading Rules:

must correctly identify middle word

Correct Answer:

1

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.64
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.35
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.28
ollama/phi3.5:latest βœ— 0 0.0 βœ— 1.70
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.31
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.26
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.15
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.23
openai/gpt-4o-mini βœ— 0 0.0 βœ— 1.21
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.21
openai/o3-mini βœ“ 0 1.0 βœ“ 1.26
openai/o1-mini βœ“ 0 1.0 βœ“ 1.33
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.49
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.36
ollama/cogito:8b βœ— 0 0.0 βœ— 1.70
ollama/cogito:3b βœ— 0 0.0 βœ— 1.46
ollama/qwen2.5-coder:3b βœ“ 0 1.0 βœ“ 0.97
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.30
10
Count Rs next to number characters

Problem Text:

Count all the letter R's in the following phrase where the R is next to a number character: "1R1RRRPPPR5"?

Grading Rules:

must correctly ignore numbers

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 3.81
ollama/gemma2:latest βœ— 0 0.0 βœ— 1.98
ollama/phi4:latest βœ“ 0 1.0 βœ“ 2.33
ollama/phi3.5:latest βœ“ 0 1.0 βœ“ 1.64
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.33
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.55
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.36
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.32
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.41
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 2.82
openai/o3-mini βœ“ 0 1.0 βœ“ 0.99
openai/o1-mini βœ“ 0 1.0 βœ“ 1.45
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 2.44
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.91
ollama/cogito:8b βœ— 0 0.0 βœ— 1.94
ollama/cogito:3b βœ— 0 0.0 βœ— 2.05
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 2.03
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.27
11
Count Rs in the Longest Word

Problem Text:

Count the letter R's in the longest word in the phrase: "Rapid runner races"?

Grading Rules:

must correctly count the letters

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 7.82
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.49
ollama/phi4:latest βœ“ 0 1.0 βœ“ 2.19
ollama/phi3.5:latest βœ“ 0 1.0 βœ“ 1.45
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.35
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.13
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.45
openai/gpt-4o βœ— 0 0.0 βœ— 1.33
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.63
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.37
openai/o3-mini βœ“ 0 1.0 βœ“ 1.21
openai/o1-mini βœ“ 0 1.0 βœ“ 1.12
anthropic/claude-3-7-sonnet-latest βœ— 0 0.0 βœ— 5.30
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.31
ollama/cogito:8b βœ— 0 0.0 βœ— 2.28
ollama/cogito:3b βœ— 0 0.0 βœ— 2.71
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 2.37
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 1.65
12
Count Rs - No R's Exist

Problem Text:

Count the letter R's in the phrase: "This Sentance is a Fancy Message"?

Grading Rules:

must correctly count the letters

Correct Answer:

0

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 8.48
ollama/gemma2:latest βœ— 0 0.0 βœ— 1.65
ollama/phi4:latest βœ— 0 0.0 βœ— 1.90
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.06
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.19
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 0.95
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.04
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.26
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.65
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.48
openai/o3-mini βœ“ 0 1.0 βœ“ 0.92
openai/o1-mini βœ“ 0 1.0 βœ“ 1.24
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.39
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.48
ollama/cogito:8b βœ— 0 0.0 βœ— 1.74
ollama/cogito:3b βœ— 0 0.0 βœ— 1.14
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.64
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.59
13
Count Rs - No R's Exist

Problem Text:

Count the letter R's in the phrase: "This Sentance is a Fancy Message"?

Grading Rules:

must correctly count the letters

Correct Answer:

0

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 2.10
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.44
ollama/phi4:latest βœ— 0 0.0 βœ— 2.02
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.46
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.40
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.30
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 5.12
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.35
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 0.94
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.73
openai/o3-mini βœ“ 0 1.0 βœ“ 1.07
openai/o1-mini βœ“ 0 1.0 βœ“ 1.26
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.31
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.25
ollama/cogito:8b βœ— 0 0.0 βœ— 1.11
ollama/cogito:3b βœ— 0 0.0 βœ— 1.22
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.25
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.29
14
Count Rs followed by vowels

Problem Text:

How many letter R's in the sentence "Ravens roost on a berry colored rangerover" are immediately followed by a vowel (a, e, i, o, u)?

Grading Rules:

must correctly count only the R's followed by vowels

Correct Answer:

5

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 2.82
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.89
ollama/phi4:latest βœ— 0 0.0 βœ— 2.34
ollama/phi3.5:latest βœ— 0 0.0 βœ— 3.20
openai/gpt-4.1 βœ— 0 0.0 βœ— 2.12
openai/gpt-4.1-mini βœ— 0 0.0 βœ— 3.70
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.72
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.86
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.90
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.56
openai/o3-mini βœ“ 0 1.0 βœ“ 1.56
openai/o1-mini βœ“ 0 1.0 βœ“ 2.35
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.77
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 2.15
ollama/cogito:8b βœ— 0 0.0 βœ— 3.26
ollama/cogito:3b βœ— 0 0.0 βœ— 1.71
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.78
ollama/llama3.2:latest βœ— 0 0.0 βœ— 3.30
15
Count internal Rs

Problem Text:

How many letter R's in the word "rearrange" are not at the beginning or the end of the word?

Grading Rules:

must correctly count only the R's that are not the first or last letter

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 2.66
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.09
ollama/phi4:latest βœ— 0 0.0 βœ— 1.97
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.45
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 2.26
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 2.24
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 2.86
openai/gpt-4o βœ— 0 0.0 βœ— 2.37
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.75
openai/gpt-3.5-turbo βœ“ 0 1.0 βœ“ 1.37
openai/o3-mini βœ“ 0 1.0 βœ“ 1.13
openai/o1-mini βœ“ 0 1.0 βœ“ 1.13
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.16
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 2.00
ollama/cogito:8b βœ— 0 0.0 βœ— 1.68
ollama/cogito:3b βœ“ 0 1.0 βœ“ 1.45
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.64
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.52
16
Count Rs in every second word

Problem Text:

In the sentence "Red roses are rarely red," count the number of letter R's in every second word, starting from the first word.

Grading Rules:

must correctly identify every second word and count the R's in those

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 2.85
ollama/gemma2:latest βœ— 0 0.0 βœ— 1.68
ollama/phi4:latest βœ— 0 0.0 βœ— 2.06
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.41
openai/gpt-4.1 βœ— 0 0.0 βœ— 1.75
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.58
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.77
openai/gpt-4o βœ— 0 0.0 βœ— 4.07
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.44
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 3.24
openai/o3-mini βœ“ 0 1.0 βœ“ 1.70
openai/o1-mini βœ“ 0 1.0 βœ“ 1.64
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 2.21
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.02
ollama/cogito:8b βœ— 0 0.0 βœ— 1.80
ollama/cogito:3b βœ— 0 0.0 βœ— 1.73
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 2.15
ollama/llama3.2:latest βœ“ 0 1.0 βœ“ 2.04
17
Count Rs in words with S

Problem Text:

In the sentence "Ravens and sparrows share resources," count the number of letter R's in words that also contain the letter S.

Grading Rules:

must correctly identify words containing S and count the R's in those words

Correct Answer:

6

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 2.54
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.57
ollama/phi4:latest βœ— 0 0.0 βœ— 1.51
ollama/phi3.5:latest βœ— 0 0.0 βœ— 1.98
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.63
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 2.51
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.43
openai/gpt-4o βœ“ 0 1.0 βœ“ 2.29
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.64
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.71
openai/o3-mini βœ“ 0 1.0 βœ“ 1.52
openai/o1-mini βœ“ 0 1.0 βœ“ 2.83
anthropic/claude-3-7-sonnet-latest βœ— 0 0.0 βœ— 2.37
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 2.35
ollama/cogito:8b βœ— 0 0.0 βœ— 2.90
ollama/cogito:3b βœ— 0 0.0 βœ— 2.16
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.77
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.83
18
Count Rs in repeated words

Problem Text:

In the sentence "Rare rabbits run rapidly, but rare rabbits rest rarely," count the number of letter R's in unique words that appear more than once.

Grading Rules:

must identify unique words that appear more than once and count the R's in those words

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 1.92
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 4.92
ollama/phi4:latest βœ— 0 0.0 βœ— 4.62
ollama/phi3.5:latest βœ— 0 0.0 βœ— 1.86
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.90
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.63
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.57
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.28
openai/gpt-4o-mini βœ— 0 0.0 βœ— 1.52
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.21
openai/o3-mini βœ“ 0 1.0 βœ“ 1.43
openai/o1-mini βœ“ 0 1.0 βœ“ 2.42
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.47
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 3.12
ollama/cogito:8b βœ— 0 0.0 βœ— 1.98
ollama/cogito:3b βœ— 0 0.0 βœ— 1.72
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 2.10
ollama/llama3.2:latest βœ— 0 0.0 βœ— 2.63
19
Count Rs in even-length words

Problem Text:

In the sentence "river rowers raft happily" count the number of letter R's in words that have an even number of letters.

Grading Rules:

must correctly identify words with even number of letters and count the R's in those

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ“ 0 1.0 βœ“ 1.42
ollama/gemma2:latest βœ“ 0 1.0 βœ“ 1.43
ollama/phi4:latest βœ“ 0 1.0 βœ“ 1.35
ollama/phi3.5:latest βœ— 0 0.0 βœ— 2.98
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 2.05
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 2.62
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.07
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.69
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.60
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.54
openai/o3-mini βœ“ 0 1.0 βœ“ 1.21
openai/o1-mini βœ“ 0 1.0 βœ“ 1.73
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.59
anthropic/claude-3-5-sonnet-latest βœ“ 0 1.0 βœ“ 1.19
ollama/cogito:8b βœ“ 0 1.0 βœ“ 2.59
ollama/cogito:3b βœ— 0 0.0 βœ— 1.43
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 2.65
ollama/llama3.2:latest βœ— 0 0.0 βœ— 2.76
20
Count Rs in words starting with a vowel

Problem Text:

In the sentence "Eagles and ostriches are birds," count the number of letter R's in words that start with a vowel.

Grading Rules:

must correctly identify words starting with a vowel and count the R's in those

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 2.23
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.45
ollama/phi4:latest βœ— 0 0.0 βœ— 2.06
ollama/phi3.5:latest βœ— 0 0.0 βœ— 5.08
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.56
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.64
openai/gpt-4.1-nano βœ“ 0 1.0 βœ“ 1.89
openai/gpt-4o βœ“ 0 1.0 βœ“ 1.31
openai/gpt-4o-mini βœ— 0 0.0 βœ— 1.59
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 4.15
openai/o3-mini βœ“ 0 1.0 βœ“ 1.62
openai/o1-mini βœ“ 0 1.0 βœ“ 2.25
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.73
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.87
ollama/cogito:8b βœ— 0 0.0 βœ— 2.94
ollama/cogito:3b βœ— 0 0.0 βœ— 1.85
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.56
ollama/llama3.2:latest βœ— 0 1.0 βœ— 1.42
21
Count Rs in long words

Problem Text:

In the sentence "The remarkable researcher presented revolutionary results," count the number of letter R's in words that have MORE THAN 10 letters.

Grading Rules:

must correctly identify that revolutionary is the only word with more than 10 letters or say the answer is 2

Correct Answer:

2

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b βœ— 0 0.0 βœ— 1.40
ollama/gemma2:latest βœ— 0 0.0 βœ— 2.24
ollama/phi4:latest βœ— 0 0.0 βœ— 1.50
ollama/phi3.5:latest βœ— 0 0.0 βœ— 1.86
openai/gpt-4.1 βœ“ 0 1.0 βœ“ 1.40
openai/gpt-4.1-mini βœ“ 0 1.0 βœ“ 1.65
openai/gpt-4.1-nano βœ— 0 0.0 βœ— 1.68
openai/gpt-4o βœ— 0 0.0 βœ— 2.50
openai/gpt-4o-mini βœ“ 0 1.0 βœ“ 1.55
openai/gpt-3.5-turbo βœ— 0 0.0 βœ— 1.50
openai/o3-mini βœ“ 0 1.0 βœ“ 1.52
openai/o1-mini βœ“ 0 1.0 βœ“ 2.92
anthropic/claude-3-7-sonnet-latest βœ“ 0 1.0 βœ“ 1.41
anthropic/claude-3-5-sonnet-latest βœ— 0 0.0 βœ— 1.48
ollama/cogito:8b βœ— 0 0.0 βœ— 2.52
ollama/cogito:3b βœ— 0 0.0 βœ— 1.67
ollama/qwen2.5-coder:3b βœ— 0 0.0 βœ— 1.16
ollama/llama3.2:latest βœ— 0 0.0 βœ— 1.57
Generated with πŸ›‘οΈ Guardsail 0.2.1
For more details visit magicpilllabs.com