Magicpill Labs GuardSail Demo

A simple eval focused on broad general problems to demonstrate the features in GuardSail

Generated on 2025-04-05 11:40:14

9
Models Evaluated
11
Total Problems
openai/gpt-4o
Grader Model
openai/o3-mini
Best Performer
1009.43s
Test Time
210.54s
Grade Time
98,460
Total Tokens
Input: 87,285 | Output: 11,175
$0.5428
Total Cost

Model Performance

Model Pass Rate Token Usage Cost
openai/o3-mini
11/11 (100.0%)
9,403 (in: 8,221, out: 1,182) $0.0719 (Solving: $0.0395, Grading: $0.0324)
openai/o1-mini
11/11 (100.0%)
11,729 (in: 10,497, out: 1,232) $0.1485 (Solving: $0.1099, Grading: $0.0386)
openai/gpt-4o-mini
10/11 (90.9%)
10,599 (in: 9,446, out: 1,153) $0.0373 (Solving: $0.0022, Grading: $0.0351)
ollama/gemma3:12b
9/11 (81.8%)
12,177 (in: 10,977, out: 1,200) $0.0394 (Solving: $0.0000, Grading: $0.0394)
ollama/phi4:latest
7/11 (63.6%)
12,493 (in: 11,194, out: 1,299) $0.0410 (Solving: $0.0000, Grading: $0.0410)
anthropic/claude-3-5-sonnet-latest
7/11 (63.6%)
10,406 (in: 9,168, out: 1,238) $0.0918 (Solving: $0.0565, Grading: $0.0353)
openai/gpt-3.5-turbo
6/11 (54.5%)
10,025 (in: 8,842, out: 1,183) $0.0387 (Solving: $0.0047, Grading: $0.0339)
ollama/gemma2:latest
5/11 (45.5%)
10,332 (in: 9,059, out: 1,273) $0.0354 (Solving: $0.0000, Grading: $0.0354)
ollama/qwen2.5-coder:3b
4/11 (36.4%)
11,296 (in: 9,881, out: 1,415) $0.0389 (Solving: $0.0000, Grading: $0.0389)

Problem Details

Index Problem Name
1
Integral of Bose-Einstein Function

Problem Text:

Evaluate the definite integral $$ I = \int_{0}^{\infty} \frac{x^3}{e^x - 1}\, dx. $$

Grading Rules:

must provide the correct closed form in terms of π.

Correct Answer:

\(\pi^4/15\)

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.67
1 1.0 1.41
2 1.0 1.36
ollama/gemma2:latest 0 0.0 1.84
1 0.0 2.10
2 0.0 1.97
ollama/phi4:latest 0 1.0 1.52
1 1.0 1.21
2 1.0 1.45
ollama/qwen2.5-coder:3b 0 0.0 2.48
1 0.0 2.17
2 0.0 2.43
openai/o3-mini 0 1.0 1.30
1 1.0 1.95
2 1.0 1.33
openai/o1-mini 0 1.0 1.00
1 1.0 1.39
2 1.0 0.89
anthropic/claude-3-5-sonnet-latest 0 1.0 1.28
1 1.0 1.32
2 1.0 1.29
openai/gpt-4o-mini 0 1.0 1.40
1 1.0 2.41
2 1.0 1.18
openai/gpt-3.5-turbo 0 0.0 2.41
1 0.0 1.57
2 0.0 1.82
2
Alternating Zeta Series

Problem Text:

Evaluate the infinite series \[ S = \sum_{n=1}^{\infty} \frac{(-1)^{n+1}}{n^2}. \]

Grading Rules:

must state the correct closed form in terms of π.

Correct Answer:

\(\pi^2/12\)

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.05
ollama/gemma2:latest 0 1.0 1.36
ollama/phi4:latest 0 1.0 1.32
ollama/qwen2.5-coder:3b 0 1.0 1.95
openai/o3-mini 0 1.0 1.88
openai/o1-mini 0 1.0 1.45
anthropic/claude-3-5-sonnet-latest 0 1.0 3.64
openai/gpt-4o-mini 0 1.0 1.32
openai/gpt-3.5-turbo 0 1.0 1.18
3
Minimal Polynomial Derivation

Problem Text:

Find the minimal polynomial over the rational numbers for \[ \sqrt{2} + \sqrt{3}. \]

Grading Rules:

must correctly derive the minimal polynomial in standard form.

Correct Answer:

x^4 - 10x^2 + 1 = 0

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.79
ollama/gemma2:latest 0 0.0 1.72
ollama/phi4:latest 0 1.0 1.57
ollama/qwen2.5-coder:3b 0 1.0 1.35
openai/o3-mini 0 1.0 2.63
openai/o1-mini 0 1.0 2.06
anthropic/claude-3-5-sonnet-latest 0 1.0 1.79
openai/gpt-4o-mini 0 1.0 3.86
openai/gpt-3.5-turbo 0 1.0 1.42
4
Rs - Strawberry

Problem Text:

how many letter R's are in the word strawberry?

Grading Rules:

must state the correct answer

Correct Answer:

3

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 0.81
ollama/gemma2:latest 0 0.0 0.91
ollama/phi4:latest 0 0.0 1.71
ollama/qwen2.5-coder:3b 0 0.0 1.14
openai/o3-mini 0 1.0 0.98
openai/o1-mini 0 1.0 1.28
anthropic/claude-3-5-sonnet-latest 0 0.0 1.30
openai/gpt-4o-mini 0 1.0 1.70
openai/gpt-3.5-turbo 0 0.0 0.93
5
Rs - Complex

Problem Text:

how many letter R's are in the sentance "ravens roost on a berry colored rangerover"?

Grading Rules:

must state the correct answer

Correct Answer:

8

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 0.0 1.30
1 0.0 2.02
2 0.0 0.97
ollama/gemma2:latest 0 0.0 2.12
1 0.0 1.17
2 0.0 1.47
ollama/phi4:latest 0 0.0 1.15
1 0.0 1.18
2 0.0 1.83
ollama/qwen2.5-coder:3b 0 0.0 1.99
1 0.0 2.41
2 0.0 0.92
openai/o3-mini 0 1.0 1.18
1 1.0 1.41
2 1.0 1.75
openai/o1-mini 0 0.0 1.45
1 1.0 1.19
2 0.0 2.58
anthropic/claude-3-5-sonnet-latest 0 0.0 1.33
1 0.0 1.41
2 0.0 1.27
openai/gpt-4o-mini 0 0.0 0.87
1 0.0 0.97
2 0.0 1.20
openai/gpt-3.5-turbo 0 0.0 1.56
1 0.0 0.89
2 0.0 1.00
6
Stacking 3 Objects

Problem Text:

if you have 3 cubes of equal dimensions: one made of paper, one of metal and one of glass, what is the safest way to stack them so none of them break?

Grading Rules:

the tester, must get the order correct

Correct Answer:

bottom is metal, middle is glass and top is paper

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.05
ollama/gemma2:latest 0 0.0 1.95
ollama/phi4:latest 0 0.0 2.27
ollama/qwen2.5-coder:3b 0 0.0 1.75
openai/o3-mini 0 1.0 1.31
openai/o1-mini 0 1.0 1.23
anthropic/claude-3-5-sonnet-latest 0 1.0 1.47
openai/gpt-4o-mini 0 1.0 1.11
openai/gpt-3.5-turbo 0 0.0 1.43
7
Balancing a Plank

Problem Text:

You have a wooden plank balanced on a round log in the center. If you place a heavy rock on one end and a small feather on the other, what will happen?

Grading Rules:

must describe the correct direction of the tilt

Correct Answer:

The plank will tip toward the side with the rock

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.22
ollama/gemma2:latest 0 1.0 1.50
ollama/phi4:latest 0 1.0 1.53
ollama/qwen2.5-coder:3b 0 1.0 1.13
openai/o3-mini 0 1.0 1.10
openai/o1-mini 0 1.0 1.43
anthropic/claude-3-5-sonnet-latest 0 1.0 1.38
openai/gpt-4o-mini 0 1.0 0.85
openai/gpt-3.5-turbo 0 1.0 1.28
8
Ice vs Metal - Heat

Problem Text:

If you touch a piece of metal and a piece of wood that are both sitting in a room at the same temperature, why does the metal feel colder?

Grading Rules:

must mention heat conduction and difference in heat transfer

Correct Answer:

Because metal conducts heat away from your hand faster than wood

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.60
ollama/gemma2:latest 0 1.0 1.46
ollama/phi4:latest 0 1.0 1.59
ollama/qwen2.5-coder:3b 0 0.0 2.15
openai/o3-mini 0 1.0 1.17
openai/o1-mini 0 1.0 2.06
anthropic/claude-3-5-sonnet-latest 0 1.0 3.50
openai/gpt-4o-mini 0 1.0 1.19
openai/gpt-3.5-turbo 0 1.0 1.52
9
Floating Objects

Problem Text:

If you place a rubber ball, a rock, and a piece of wood in a tub of water, which objects will float?

Grading Rules:

must correctly identify which objects float and which sink

Correct Answer:

The rubber ball and the piece of wood will float; the rock will sink

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.38
ollama/gemma2:latest 0 0.0 1.42
ollama/phi4:latest 0 0.0 1.72
ollama/qwen2.5-coder:3b 0 0.0 1.95
openai/o3-mini 0 1.0 1.66
openai/o1-mini 0 1.0 1.40
anthropic/claude-3-5-sonnet-latest 0 0.0 1.82
openai/gpt-4o-mini 0 1.0 1.37
openai/gpt-3.5-turbo 0 0.0 1.61
10
Mirror Image Direction

Problem Text:

If you raise your right hand while standing in front of a mirror, which hand does your reflection appear to raise?

Grading Rules:

must correctly state the reflection shows the opposite hand

Correct Answer:

Left hand

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 1.0 1.54
ollama/gemma2:latest 0 1.0 1.80
ollama/phi4:latest 0 1.0 2.08
ollama/qwen2.5-coder:3b 0 0.0 1.67
openai/o3-mini 0 1.0 1.40
openai/o1-mini 0 1.0 1.61
anthropic/claude-3-5-sonnet-latest 0 1.0 2.12
openai/gpt-4o-mini 0 1.0 1.27
openai/gpt-3.5-turbo 0 1.0 2.78
11
Sunlight and Shadows

Problem Text:

At noon, the sun is directly overhead. Where does your shadow point?

Grading Rules:

must mention shadow is directly below or very small

Correct Answer:

Directly under you / very short shadow

Model Results:

Model Passed Shot # Score Shot Passed Time (s)
ollama/gemma3:12b 0 0.0 1.68
ollama/gemma2:latest 0 1.0 0.99
ollama/phi4:latest 0 1.0 1.11
ollama/qwen2.5-coder:3b 0 1.0 1.41
openai/o3-mini 0 1.0 1.22
openai/o1-mini 0 1.0 1.34
anthropic/claude-3-5-sonnet-latest 0 0.0 1.16
openai/gpt-4o-mini 0 1.0 1.04
openai/gpt-3.5-turbo 0 1.0 1.75
Generated with 🛡️ Guardsail. For more details visit https://magicpilllabs.com/