Model	Pass Rate	Token Usage	Cost
openai/o3-mini	21/21 (100.0%)	15,184 (in: 13,541, out: 1,643)	$0.1237 (Solving: $0.0735, Grading: $0.0503)
openai/o1-mini	21/21 (100.0%)	15,557 (in: 13,849, out: 1,708)	$0.2269 (Solving: $0.1752, Grading: $0.0517)
openai/gpt-4.1	18/21 (85.7%)	15,730 (in: 13,937, out: 1,793)	$0.0890 (Solving: $0.0363, Grading: $0.0528)
openai/gpt-4.1-mini	18/21 (85.7%)	14,440 (in: 12,640, out: 1,800)	$0.0548 (Solving: $0.0052, Grading: $0.0496)
anthropic/claude-3-7-sonnet-latest	17/21 (81.0%)	14,201 (in: 12,402, out: 1,799)	$0.0981 (Solving: $0.0491, Grading: $0.0490)
openai/gpt-4o-mini	16/21 (76.2%)	13,894 (in: 12,108, out: 1,786)	$0.0498 (Solving: $0.0016, Grading: $0.0481)
openai/gpt-4o	14/21 (66.7%)	14,158 (in: 12,144, out: 2,014)	$0.0779 (Solving: $0.0274, Grading: $0.0505)
openai/gpt-4.1-nano	12/21 (57.1%)	14,132 (in: 12,289, out: 1,843)	$0.0503 (Solving: $0.0012, Grading: $0.0492)
anthropic/claude-3-5-sonnet-latest	11/21 (52.4%)	13,251 (in: 11,396, out: 1,855)	$0.0795 (Solving: $0.0324, Grading: $0.0470)
ollama/gemma2:latest	9/21 (42.9%)	13,594 (in: 11,604, out: 1,990)	$0.0489 (Solving: $0.0000, Grading: $0.0489)
ollama/phi4:latest	8/21 (38.1%)	15,630 (in: 13,355, out: 2,275)	$0.0561 (Solving: $0.0000, Grading: $0.0561)
ollama/llama3.2:latest	8/21 (38.1%)	13,755 (in: 11,642, out: 2,113)	$0.0502 (Solving: $0.0000, Grading: $0.0502)
ollama/gemma3:12b	7/21 (33.3%)	13,849 (in: 11,901, out: 1,948)	$0.0492 (Solving: $0.0000, Grading: $0.0492)
openai/gpt-3.5-turbo	5/21 (23.8%)	12,474 (in: 10,516, out: 1,958)	$0.0477 (Solving: $0.0018, Grading: $0.0459)
ollama/cogito:8b	4/21 (19.0%)	13,644 (in: 11,444, out: 2,200)	$0.0506 (Solving: $0.0000, Grading: $0.0506)
ollama/phi3.5:latest	3/21 (14.3%)	16,739 (in: 14,320, out: 2,419)	$0.0600 (Solving: $0.0000, Grading: $0.0600)
ollama/qwen2.5-coder:3b	3/21 (14.3%)	14,182 (in: 12,056, out: 2,126)	$0.0514 (Solving: $0.0000, Grading: $0.0514)
ollama/cogito:3b	2/21 (9.5%)	14,151 (in: 12,046, out: 2,105)	$0.0512 (Solving: $0.0000, Grading: $0.0512)

Index

Problem Name

Rs - Strawberry

Problem Text:

how many letter R's are in the word strawberry?

Grading Rules:

must state the correct answer

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	0.98
		1	1.0	✓	0.93
		2	1.0	✓	1.30
		3	1.0	✓	1.18
		4	1.0	✓	0.93

ollama/gemma2:latest	✗	0	0.0	✗	1.56
		1	0.0	✗	1.64
		2	0.0	✗	1.49
		3	0.0	✗	1.49
		4	0.0	✗	1.11

ollama/phi4:latest	✗	0	0.0	✗	1.29
		1	0.0	✗	1.37
		2	0.0	✗	1.53
		3	0.0	✗	1.44
		4	0.0	✗	1.43

ollama/phi3.5:latest	✓	0	1.0	✓	1.56
		1	0.0	✗	2.37
		2	0.0	✗	1.50
		3	1.0	✓	1.20
		4	1.0	✓	2.49

openai/gpt-4.1	✓	0	1.0	✓	1.48
		1	1.0	✓	1.67
		2	1.0	✓	1.27
		3	1.0	✓	1.23
		4	1.0	✓	1.57

openai/gpt-4.1-mini	✗	0	0.0	✗	1.32
		1	0.0	✗	1.53
		2	0.0	✗	1.49
		3	0.0	✗	6.99
		4	0.0	✗	1.99

openai/gpt-4.1-nano	✗	0	0.0	✗	1.22
		1	0.0	✗	1.47
		2	0.0	✗	1.17
		3	0.0	✗	1.26
		4	0.0	✗	3.01

openai/gpt-4o	✓	0	0.0	✗	1.65
		1	1.0	✓	1.37
		2	1.0	✓	1.21
		3	0.0	✗	1.16
		4	0.0	✗	1.22

openai/gpt-4o-mini	✓	0	1.0	✓	2.02
		1	1.0	✓	1.32
		2	0.0	✗	1.73
		3	1.0	✓	1.00
		4	1.0	✓	1.11

openai/gpt-3.5-turbo	✓	0	0.0	✗	1.14
		1	0.0	✗	1.64
		2	0.0	✗	1.62
		3	1.0	✓	1.35
		4	1.0	✓	1.47

openai/o3-mini	✓	0	1.0	✓	1.11
		1	1.0	✓	0.94
		2	1.0	✓	1.79
		3	1.0	✓	1.16
		4	1.0	✓	1.16

openai/o1-mini	✓	0	1.0	✓	1.05
		1	1.0	✓	1.26
		2	0.0	✗	1.24
		3	1.0	✓	1.31
		4	1.0	✓	1.49

anthropic/claude-3-7-sonnet-latest	✗	0	0.0	✗	1.41
		1	0.0	✗	1.28
		2	0.0	✗	1.64
		3	0.0	✗	1.66
		4	0.0	✗	1.18

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.39
		1	0.0	✗	1.53
		2	0.0	✗	1.68
		3	0.0	✗	2.26
		4	0.0	✗	1.37

ollama/cogito:8b	✓	0	0.0	✗	2.58
		1	1.0	✓	1.52
		2	0.0	✗	1.56
		3	0.0	✗	1.15
		4	1.0	✓	1.08

ollama/cogito:3b	✗	0	0.0	✗	1.79
		1	0.0	✗	1.43
		2	0.0	✗	1.05
		3	0.0	✗	1.68
		4	0.0	✗	1.84

ollama/qwen2.5-coder:3b	✓	0	1.0	✓	1.48
		1	0.0	✗	1.88
		2	0.0	✗	1.03
		3	0.0	✗	1.69
		4	0.0	✗	1.86

ollama/llama3.2:latest	✓	0	1.0	✓	1.10
		1	0.0	✗	1.37
		2	0.0	✗	1.21
		3	0.0	✗	1.38
		4	0.0	✗	1.25

Rs - Complex

Problem Text:

how many letter R's are in the sentence "ravens roost on a berry colored rangerover"?

Grading Rules:

must state the correct answer

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.21

ollama/gemma2:latest	✗	0	0.0	✗	1.37

ollama/phi4:latest	✗	0	0.0	✗	2.13

ollama/phi3.5:latest	✗	0	0.0	✗	1.40

openai/gpt-4.1	✗	0	0.0	✗	1.66

openai/gpt-4.1-mini	✗	0	0.0	✗	1.50

openai/gpt-4.1-nano	✗	0	0.0	✗	1.54

openai/gpt-4o	✓	0	1.0	✓	1.10

openai/gpt-4o-mini	✗	0	0.0	✗	0.96

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.32

openai/o3-mini	✓	0	1.0	✓	2.08

openai/o1-mini	✓	0	1.0	✓	1.75

anthropic/claude-3-7-sonnet-latest	✗	0	0.0	✗	1.32

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.23

ollama/cogito:8b	✗	0	0.0	✗	1.71

ollama/cogito:3b	✗	0	0.0	✗	1.40

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.51

ollama/llama3.2:latest	✓	0	1.0	✓	1.11

Count Rs - Ignore Last Word

Problem Text:

Ignoring the last word, how many letter R's are in the phrase "rural road runner" ignoring vowels?

Grading Rules:

must correctly ignore vowels and count R's

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	1.36

ollama/gemma2:latest	✓	0	1.0	✓	1.78

ollama/phi4:latest	✗	0	0.0	✗	2.65

ollama/phi3.5:latest	✗	0	0.0	✗	2.43

openai/gpt-4.1	✓	0	1.0	✓	5.71

openai/gpt-4.1-mini	✓	0	1.0	✓	1.41

openai/gpt-4.1-nano	✓	0	1.0	✓	1.64

openai/gpt-4o	✗	0	0.0	✗	2.96

openai/gpt-4o-mini	✗	0	0.0	✗	2.42

openai/gpt-3.5-turbo	✓	0	1.0	✓	1.28

openai/o3-mini	✓	0	1.0	✓	1.62

openai/o1-mini	✓	0	1.0	✓	1.55

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.91

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.10

ollama/cogito:8b	✓	0	1.0	✓	1.14

ollama/cogito:3b	✗	0	0.0	✗	1.39

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.50

ollama/llama3.2:latest	✗	0	0.0	✗	6.12

Count Rs - Ignore first word

Problem Text:

Ignoring the first word, how many letter R's are in the phrase "red roses rarely ripen right"?

Grading Rules:

must correctly ignore first word

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.29

ollama/gemma2:latest	✓	0	1.0	✓	1.76

ollama/phi4:latest	✓	0	1.0	✓	1.67

ollama/phi3.5:latest	✗	0	0.0	✗	4.97

openai/gpt-4.1	✓	0	1.0	✓	1.18

openai/gpt-4.1-mini	✓	0	1.0	✓	1.35

openai/gpt-4.1-nano	✓	0	1.0	✓	1.63

openai/gpt-4o	✗	0	0.0	✗	2.10

openai/gpt-4o-mini	✓	0	1.0	✓	1.63

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.38

openai/o3-mini	✓	0	1.0	✓	1.50

openai/o1-mini	✓	0	1.0	✓	1.51

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.57

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.28

ollama/cogito:8b	✗	0	0.0	✗	1.78

ollama/cogito:3b	✗	0	0.0	✗	1.61

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	0.93

ollama/llama3.2:latest	✗	0	0.0	✗	1.83

Count Rs - Only first and last word

Problem Text:

Counting letters in only the first and last words, how many R's are in "royal ravens rarely roam"?

Grading Rules:

must correctly count only first and last words

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	1.51

ollama/gemma2:latest	✓	0	1.0	✓	1.50

ollama/phi4:latest	✗	0	0.0	✗	2.18

ollama/phi3.5:latest	✗	0	0.0	✗	2.45

openai/gpt-4.1	✓	0	1.0	✓	3.59

openai/gpt-4.1-mini	✓	0	1.0	✓	1.99

openai/gpt-4.1-nano	✗	0	0.0	✗	1.65

openai/gpt-4o	✓	0	1.0	✓	1.69

openai/gpt-4o-mini	✓	0	1.0	✓	1.54

openai/gpt-3.5-turbo	✓	0	1.0	✓	1.47

openai/o3-mini	✓	0	1.0	✓	1.64

openai/o1-mini	✓	0	1.0	✓	1.10

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.69

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.49

ollama/cogito:8b	✗	0	0.0	✗	1.88

ollama/cogito:3b	✓	0	1.0	✓	1.65

ollama/qwen2.5-coder:3b	✓	0	1.0	✓	1.24

ollama/llama3.2:latest	✓	0	1.0	✓	2.08

Count Rs - Capital letters only

Problem Text:

How many uppercase letter R's are in "Rural Roaming Rancho"?

Grading Rules:

must count only uppercase letters

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.36

ollama/gemma2:latest	✓	0	1.0	✓	1.63

ollama/phi4:latest	✓	0	1.0	✓	1.48

ollama/phi3.5:latest	✗	0	0.0	✗	3.02

openai/gpt-4.1	✓	0	1.0	✓	1.81

openai/gpt-4.1-mini	✓	0	1.0	✓	1.10

openai/gpt-4.1-nano	✗	0	0.0	✗	1.56

openai/gpt-4o	✗	0	0.0	✗	1.55

openai/gpt-4o-mini	✓	0	1.0	✓	1.39

openai/gpt-3.5-turbo	✓	0	1.0	✓	0.95

openai/o3-mini	✓	0	1.0	✓	6.04

openai/o1-mini	✓	0	1.0	✓	1.24

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.34

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.00

ollama/cogito:8b	✓	0	1.0	✓	1.39

ollama/cogito:3b	✗	0	0.0	✗	1.38

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.02

ollama/llama3.2:latest	✗	0	0.0	✗	6.62

Count Rs - Exclude repeated words

Problem Text:

Excluding duplicate words, how many R's are in "rare rare rabbits run run rapidly"?

Grading Rules:

must count only unique words

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	3.28

ollama/gemma2:latest	✗	0	0.0	✗	4.30

ollama/phi4:latest	✓	0	1.0	✓	1.94

ollama/phi3.5:latest	✗	0	0.0	✗	2.75

openai/gpt-4.1	✓	0	1.0	✓	1.32

openai/gpt-4.1-mini	✓	0	1.0	✓	1.58

openai/gpt-4.1-nano	✓	0	1.0	✓	1.98

openai/gpt-4o	✓	0	1.0	✓	1.44

openai/gpt-4o-mini	✓	0	1.0	✓	1.25

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.99

openai/o3-mini	✓	0	1.0	✓	1.26

openai/o1-mini	✓	0	1.0	✓	2.20

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	5.66

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	2.52

ollama/cogito:8b	✗	0	0.0	✗	1.99

ollama/cogito:3b	✗	0	0.0	✗	2.84

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.71

ollama/llama3.2:latest	✗	0	0.0	✗	1.60

Count Rs - Ignore letters after S

Problem Text:

Ignoring letters in words after the first S appears in the word, how many R's are in "restored servers start readily"?

Grading Rules:

must correctly ignore letters after S

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.92

ollama/gemma2:latest	✓	0	1.0	✓	2.21

ollama/phi4:latest	✓	0	1.0	✓	1.30

ollama/phi3.5:latest	✗	0	0.0	✗	4.79

openai/gpt-4.1	✓	0	1.0	✓	1.74

openai/gpt-4.1-mini	✓	0	1.0	✓	2.00

openai/gpt-4.1-nano	✓	0	1.0	✓	2.12

openai/gpt-4o	✓	0	1.0	✓	1.71

openai/gpt-4o-mini	✓	0	1.0	✓	1.43

openai/gpt-3.5-turbo	✗	0	0.0	✗	2.18

openai/o3-mini	✓	0	1.0	✓	1.61

openai/o1-mini	✓	0	1.0	✓	1.75

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.27

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	2.48

ollama/cogito:8b	✗	0	0.0	✗	2.72

ollama/cogito:3b	✗	0	0.0	✗	1.96

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.96

ollama/llama3.2:latest	✓	0	1.0	✓	1.61

Count Rs - Only middle word

Problem Text:

Only looking at the middle word in the phrase, how many R's are there in "rapid rain rarely"?

Grading Rules:

must correctly identify middle word

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.64

ollama/gemma2:latest	✓	0	1.0	✓	1.35

ollama/phi4:latest	✓	0	1.0	✓	1.28

ollama/phi3.5:latest	✗	0	0.0	✗	1.70

openai/gpt-4.1	✓	0	1.0	✓	1.31

openai/gpt-4.1-mini	✓	0	1.0	✓	1.26

openai/gpt-4.1-nano	✓	0	1.0	✓	1.15

openai/gpt-4o	✓	0	1.0	✓	1.23

openai/gpt-4o-mini	✗	0	0.0	✗	1.21

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.21

openai/o3-mini	✓	0	1.0	✓	1.26

openai/o1-mini	✓	0	1.0	✓	1.33

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.49

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.36

ollama/cogito:8b	✗	0	0.0	✗	1.70

ollama/cogito:3b	✗	0	0.0	✗	1.46

ollama/qwen2.5-coder:3b	✓	0	1.0	✓	0.97

ollama/llama3.2:latest	✓	0	1.0	✓	1.30

Count Rs next to number characters

Problem Text:

Count all the letter R's in the following phrase where the R is next to a number character: "1R1RRRPPPR5"?

Grading Rules:

must correctly ignore numbers

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	3.81

ollama/gemma2:latest	✗	0	0.0	✗	1.98

ollama/phi4:latest	✓	0	1.0	✓	2.33

ollama/phi3.5:latest	✓	0	1.0	✓	1.64

openai/gpt-4.1	✓	0	1.0	✓	1.33

openai/gpt-4.1-mini	✓	0	1.0	✓	1.55

openai/gpt-4.1-nano	✓	0	1.0	✓	1.36

openai/gpt-4o	✓	0	1.0	✓	1.32

openai/gpt-4o-mini	✓	0	1.0	✓	1.41

openai/gpt-3.5-turbo	✗	0	0.0	✗	2.82

openai/o3-mini	✓	0	1.0	✓	0.99

openai/o1-mini	✓	0	1.0	✓	1.45

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	2.44

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.91

ollama/cogito:8b	✗	0	0.0	✗	1.94

ollama/cogito:3b	✗	0	0.0	✗	2.05

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	2.03

ollama/llama3.2:latest	✓	0	1.0	✓	1.27

Count Rs in the Longest Word

Problem Text:

Count the letter R's in the longest word in the phrase: "Rapid runner races"?

Grading Rules:

must correctly count the letters

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	7.82

ollama/gemma2:latest	✓	0	1.0	✓	1.49

ollama/phi4:latest	✓	0	1.0	✓	2.19

ollama/phi3.5:latest	✓	0	1.0	✓	1.45

openai/gpt-4.1	✓	0	1.0	✓	1.35

openai/gpt-4.1-mini	✓	0	1.0	✓	1.13

openai/gpt-4.1-nano	✓	0	1.0	✓	1.45

openai/gpt-4o	✗	0	0.0	✗	1.33

openai/gpt-4o-mini	✓	0	1.0	✓	1.63

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.37

openai/o3-mini	✓	0	1.0	✓	1.21

openai/o1-mini	✓	0	1.0	✓	1.12

anthropic/claude-3-7-sonnet-latest	✗	0	0.0	✗	5.30

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.31

ollama/cogito:8b	✗	0	0.0	✗	2.28

ollama/cogito:3b	✗	0	0.0	✗	2.71

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	2.37

ollama/llama3.2:latest	✓	0	1.0	✓	1.65

Count Rs - No R's Exist

Problem Text:

Count the letter R's in the phrase: "This Sentance is a Fancy Message"?

Grading Rules:

must correctly count the letters

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	8.48

ollama/gemma2:latest	✗	0	0.0	✗	1.65

ollama/phi4:latest	✗	0	0.0	✗	1.90

ollama/phi3.5:latest	✗	0	0.0	✗	2.06

openai/gpt-4.1	✓	0	1.0	✓	1.19

openai/gpt-4.1-mini	✓	0	1.0	✓	0.95

openai/gpt-4.1-nano	✗	0	0.0	✗	1.04

openai/gpt-4o	✓	0	1.0	✓	1.26

openai/gpt-4o-mini	✓	0	1.0	✓	1.65

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.48

openai/o3-mini	✓	0	1.0	✓	0.92

openai/o1-mini	✓	0	1.0	✓	1.24

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.39

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.48

ollama/cogito:8b	✗	0	0.0	✗	1.74

ollama/cogito:3b	✗	0	0.0	✗	1.14

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.64

ollama/llama3.2:latest	✗	0	0.0	✗	1.59

Count Rs - No R's Exist

Problem Text:

Count the letter R's in the phrase: "This Sentance is a Fancy Message"?

Grading Rules:

must correctly count the letters

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	2.10

ollama/gemma2:latest	✗	0	0.0	✗	2.44

ollama/phi4:latest	✗	0	0.0	✗	2.02

ollama/phi3.5:latest	✗	0	0.0	✗	2.46

openai/gpt-4.1	✓	0	1.0	✓	1.40

openai/gpt-4.1-mini	✓	0	1.0	✓	1.30

openai/gpt-4.1-nano	✗	0	0.0	✗	5.12

openai/gpt-4o	✓	0	1.0	✓	1.35

openai/gpt-4o-mini	✓	0	1.0	✓	0.94

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.73

openai/o3-mini	✓	0	1.0	✓	1.07

openai/o1-mini	✓	0	1.0	✓	1.26

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.31

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.25

ollama/cogito:8b	✗	0	0.0	✗	1.11

ollama/cogito:3b	✗	0	0.0	✗	1.22

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.25

ollama/llama3.2:latest	✗	0	0.0	✗	1.29

Count Rs followed by vowels

Problem Text:

How many letter R's in the sentence "Ravens roost on a berry colored rangerover" are immediately followed by a vowel (a, e, i, o, u)?

Grading Rules:

must correctly count only the R's followed by vowels

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	2.82

ollama/gemma2:latest	✗	0	0.0	✗	2.89

ollama/phi4:latest	✗	0	0.0	✗	2.34

ollama/phi3.5:latest	✗	0	0.0	✗	3.20

openai/gpt-4.1	✗	0	0.0	✗	2.12

openai/gpt-4.1-mini	✗	0	0.0	✗	3.70

openai/gpt-4.1-nano	✓	0	1.0	✓	1.72

openai/gpt-4o	✓	0	1.0	✓	1.86

openai/gpt-4o-mini	✓	0	1.0	✓	1.90

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.56

openai/o3-mini	✓	0	1.0	✓	1.56

openai/o1-mini	✓	0	1.0	✓	2.35

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.77

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	2.15

ollama/cogito:8b	✗	0	0.0	✗	3.26

ollama/cogito:3b	✗	0	0.0	✗	1.71

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.78

ollama/llama3.2:latest	✗	0	0.0	✗	3.30

Count internal Rs

Problem Text:

How many letter R's in the word "rearrange" are not at the beginning or the end of the word?

Grading Rules:

must correctly count only the R's that are not the first or last letter

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	2.66

ollama/gemma2:latest	✗	0	0.0	✗	2.09

ollama/phi4:latest	✗	0	0.0	✗	1.97

ollama/phi3.5:latest	✗	0	0.0	✗	2.45

openai/gpt-4.1	✓	0	1.0	✓	2.26

openai/gpt-4.1-mini	✓	0	1.0	✓	2.24

openai/gpt-4.1-nano	✗	0	0.0	✗	2.86

openai/gpt-4o	✗	0	0.0	✗	2.37

openai/gpt-4o-mini	✓	0	1.0	✓	1.75

openai/gpt-3.5-turbo	✓	0	1.0	✓	1.37

openai/o3-mini	✓	0	1.0	✓	1.13

openai/o1-mini	✓	0	1.0	✓	1.13

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.16

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	2.00

ollama/cogito:8b	✗	0	0.0	✗	1.68

ollama/cogito:3b	✓	0	1.0	✓	1.45

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.64

ollama/llama3.2:latest	✗	0	0.0	✗	1.52

Count Rs in every second word

Problem Text:

In the sentence "Red roses are rarely red," count the number of letter R's in every second word, starting from the first word.

Grading Rules:

must correctly identify every second word and count the R's in those

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	2.85

ollama/gemma2:latest	✗	0	0.0	✗	1.68

ollama/phi4:latest	✗	0	0.0	✗	2.06

ollama/phi3.5:latest	✗	0	0.0	✗	2.41

openai/gpt-4.1	✗	0	0.0	✗	1.75

openai/gpt-4.1-mini	✓	0	1.0	✓	1.58

openai/gpt-4.1-nano	✗	0	0.0	✗	1.77

openai/gpt-4o	✗	0	0.0	✗	4.07

openai/gpt-4o-mini	✓	0	1.0	✓	1.44

openai/gpt-3.5-turbo	✗	0	0.0	✗	3.24

openai/o3-mini	✓	0	1.0	✓	1.70

openai/o1-mini	✓	0	1.0	✓	1.64

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	2.21

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.02

ollama/cogito:8b	✗	0	0.0	✗	1.80

ollama/cogito:3b	✗	0	0.0	✗	1.73

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	2.15

ollama/llama3.2:latest	✓	0	1.0	✓	2.04

Count Rs in words with S

Problem Text:

In the sentence "Ravens and sparrows share resources," count the number of letter R's in words that also contain the letter S.

Grading Rules:

must correctly identify words containing S and count the R's in those words

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	2.54

ollama/gemma2:latest	✗	0	0.0	✗	2.57

ollama/phi4:latest	✗	0	0.0	✗	1.51

ollama/phi3.5:latest	✗	0	0.0	✗	1.98

openai/gpt-4.1	✓	0	1.0	✓	1.63

openai/gpt-4.1-mini	✓	0	1.0	✓	2.51

openai/gpt-4.1-nano	✓	0	1.0	✓	1.43

openai/gpt-4o	✓	0	1.0	✓	2.29

openai/gpt-4o-mini	✓	0	1.0	✓	1.64

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.71

openai/o3-mini	✓	0	1.0	✓	1.52

openai/o1-mini	✓	0	1.0	✓	2.83

anthropic/claude-3-7-sonnet-latest	✗	0	0.0	✗	2.37

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	2.35

ollama/cogito:8b	✗	0	0.0	✗	2.90

ollama/cogito:3b	✗	0	0.0	✗	2.16

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.77

ollama/llama3.2:latest	✗	0	0.0	✗	1.83

Count Rs in repeated words

Problem Text:

In the sentence "Rare rabbits run rapidly, but rare rabbits rest rarely," count the number of letter R's in unique words that appear more than once.

Grading Rules:

must identify unique words that appear more than once and count the R's in those words

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	1.92

ollama/gemma2:latest	✓	0	1.0	✓	4.92

ollama/phi4:latest	✗	0	0.0	✗	4.62

ollama/phi3.5:latest	✗	0	0.0	✗	1.86

openai/gpt-4.1	✓	0	1.0	✓	1.90

openai/gpt-4.1-mini	✓	0	1.0	✓	1.63

openai/gpt-4.1-nano	✓	0	1.0	✓	1.57

openai/gpt-4o	✓	0	1.0	✓	1.28

openai/gpt-4o-mini	✗	0	0.0	✗	1.52

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.21

openai/o3-mini	✓	0	1.0	✓	1.43

openai/o1-mini	✓	0	1.0	✓	2.42

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.47

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	3.12

ollama/cogito:8b	✗	0	0.0	✗	1.98

ollama/cogito:3b	✗	0	0.0	✗	1.72

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	2.10

ollama/llama3.2:latest	✗	0	0.0	✗	2.63

Count Rs in even-length words

Problem Text:

In the sentence "river rowers raft happily" count the number of letter R's in words that have an even number of letters.

Grading Rules:

must correctly identify words with even number of letters and count the R's in those

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✓	0	1.0	✓	1.42

ollama/gemma2:latest	✓	0	1.0	✓	1.43

ollama/phi4:latest	✓	0	1.0	✓	1.35

ollama/phi3.5:latest	✗	0	0.0	✗	2.98

openai/gpt-4.1	✓	0	1.0	✓	2.05

openai/gpt-4.1-mini	✓	0	1.0	✓	2.62

openai/gpt-4.1-nano	✓	0	1.0	✓	1.07

openai/gpt-4o	✓	0	1.0	✓	1.69

openai/gpt-4o-mini	✓	0	1.0	✓	1.60

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.54

openai/o3-mini	✓	0	1.0	✓	1.21

openai/o1-mini	✓	0	1.0	✓	1.73

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.59

anthropic/claude-3-5-sonnet-latest	✓	0	1.0	✓	1.19

ollama/cogito:8b	✓	0	1.0	✓	2.59

ollama/cogito:3b	✗	0	0.0	✗	1.43

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	2.65

ollama/llama3.2:latest	✗	0	0.0	✗	2.76

Count Rs in words starting with a vowel

Problem Text:

In the sentence "Eagles and ostriches are birds," count the number of letter R's in words that start with a vowel.

Grading Rules:

must correctly identify words starting with a vowel and count the R's in those

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	2.23

ollama/gemma2:latest	✗	0	0.0	✗	2.45

ollama/phi4:latest	✗	0	0.0	✗	2.06

ollama/phi3.5:latest	✗	0	0.0	✗	5.08

openai/gpt-4.1	✓	0	1.0	✓	1.56

openai/gpt-4.1-mini	✓	0	1.0	✓	1.64

openai/gpt-4.1-nano	✓	0	1.0	✓	1.89

openai/gpt-4o	✓	0	1.0	✓	1.31

openai/gpt-4o-mini	✗	0	0.0	✗	1.59

openai/gpt-3.5-turbo	✗	0	0.0	✗	4.15

openai/o3-mini	✓	0	1.0	✓	1.62

openai/o1-mini	✓	0	1.0	✓	2.25

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.73

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.87

ollama/cogito:8b	✗	0	0.0	✗	2.94

ollama/cogito:3b	✗	0	0.0	✗	1.85

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.56

ollama/llama3.2:latest	✗	0	1.0	✗	1.42

Count Rs in long words

Problem Text:

In the sentence "The remarkable researcher presented revolutionary results," count the number of letter R's in words that have MORE THAN 10 letters.

Grading Rules:

must correctly identify that revolutionary is the only word with more than 10 letters or say the answer is 2

Correct Answer:

Model Results:

Model	Passed	Shot #	Score	Shot Passed	Time (s)
ollama/gemma3:12b	✗	0	0.0	✗	1.40

ollama/gemma2:latest	✗	0	0.0	✗	2.24

ollama/phi4:latest	✗	0	0.0	✗	1.50

ollama/phi3.5:latest	✗	0	0.0	✗	1.86

openai/gpt-4.1	✓	0	1.0	✓	1.40

openai/gpt-4.1-mini	✓	0	1.0	✓	1.65

openai/gpt-4.1-nano	✗	0	0.0	✗	1.68

openai/gpt-4o	✗	0	0.0	✗	2.50

openai/gpt-4o-mini	✓	0	1.0	✓	1.55

openai/gpt-3.5-turbo	✗	0	0.0	✗	1.50

openai/o3-mini	✓	0	1.0	✓	1.52

openai/o1-mini	✓	0	1.0	✓	2.92

anthropic/claude-3-7-sonnet-latest	✓	0	1.0	✓	1.41

anthropic/claude-3-5-sonnet-latest	✗	0	0.0	✗	1.48

ollama/cogito:8b	✗	0	0.0	✗	2.52

ollama/cogito:3b	✗	0	0.0	✗	1.67

ollama/qwen2.5-coder:3b	✗	0	0.0	✗	1.16

ollama/llama3.2:latest	✗	0	0.0	✗	1.57

R-Bench

Model Performance

Problem Details

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text:

Grading Rules:

Correct Answer:

Model Results:

Problem Text: