A simple eval focused on broad general problems to demonstrate the features in GuardSail
Generated on 2025-04-05 11:40:14
Model | Pass Rate | Token Usage | Cost |
---|---|---|---|
openai/o3-mini |
|
9,403 (in: 8,221, out: 1,182) | $0.0719 (Solving: $0.0395, Grading: $0.0324) |
openai/o1-mini |
|
11,729 (in: 10,497, out: 1,232) | $0.1485 (Solving: $0.1099, Grading: $0.0386) |
openai/gpt-4o-mini |
|
10,599 (in: 9,446, out: 1,153) | $0.0373 (Solving: $0.0022, Grading: $0.0351) |
ollama/gemma3:12b |
|
12,177 (in: 10,977, out: 1,200) | $0.0394 (Solving: $0.0000, Grading: $0.0394) |
ollama/phi4:latest |
|
12,493 (in: 11,194, out: 1,299) | $0.0410 (Solving: $0.0000, Grading: $0.0410) |
anthropic/claude-3-5-sonnet-latest |
|
10,406 (in: 9,168, out: 1,238) | $0.0918 (Solving: $0.0565, Grading: $0.0353) |
openai/gpt-3.5-turbo |
|
10,025 (in: 8,842, out: 1,183) | $0.0387 (Solving: $0.0047, Grading: $0.0339) |
ollama/gemma2:latest |
|
10,332 (in: 9,059, out: 1,273) | $0.0354 (Solving: $0.0000, Grading: $0.0354) |
ollama/qwen2.5-coder:3b |
|
11,296 (in: 9,881, out: 1,415) | $0.0389 (Solving: $0.0000, Grading: $0.0389) |
Index | Problem Name | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
Integral of Bose-Einstein FunctionProblem Text:Evaluate the definite integral
$$
I = \int_{0}^{\infty} \frac{x^3}{e^x - 1}\, dx.
$$
Grading Rules:must provide the correct closed form in terms of π.
Correct Answer:\(\pi^4/15\)
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 |
Alternating Zeta SeriesProblem Text:Evaluate the infinite series
\[
S = \sum_{n=1}^{\infty} \frac{(-1)^{n+1}}{n^2}.
\]
Grading Rules:must state the correct closed form in terms of π.
Correct Answer:\(\pi^2/12\)
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 |
Minimal Polynomial DerivationProblem Text:Find the minimal polynomial over the rational numbers for
\[
\sqrt{2} + \sqrt{3}.
\]
Grading Rules:must correctly derive the minimal polynomial in standard form.
Correct Answer:x^4 - 10x^2 + 1 = 0
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 |
Rs - StrawberryProblem Text:how many letter R's are in the word strawberry?
Grading Rules:must state the correct answer
Correct Answer:3
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 |
Rs - ComplexProblem Text:how many letter R's are in the sentance "ravens roost on a berry colored rangerover"?
Grading Rules:must state the correct answer
Correct Answer:8
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 |
Stacking 3 ObjectsProblem Text:if you have 3 cubes of equal dimensions: one made of paper, one of metal and one of glass,
what is the safest way to stack them so none of them break?
Grading Rules:the tester, must get the order correct
Correct Answer:bottom is metal, middle is glass and top is paper
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 |
Balancing a PlankProblem Text:You have a wooden plank balanced on a round log in the center. If you place a heavy rock on one end and a small feather on the other, what will happen?
Grading Rules:must describe the correct direction of the tilt
Correct Answer:The plank will tip toward the side with the rock
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 |
Ice vs Metal - HeatProblem Text:If you touch a piece of metal and a piece of wood that are both sitting in a room at the same temperature, why does the metal feel colder?
Grading Rules:must mention heat conduction and difference in heat transfer
Correct Answer:Because metal conducts heat away from your hand faster than wood
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 |
Floating ObjectsProblem Text:If you place a rubber ball, a rock, and a piece of wood in a tub of water, which objects will float?
Grading Rules:must correctly identify which objects float and which sink
Correct Answer:The rubber ball and the piece of wood will float; the rock will sink
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 |
Mirror Image DirectionProblem Text:If you raise your right hand while standing in front of a mirror, which hand does your reflection appear to raise?
Grading Rules:must correctly state the reflection shows the opposite hand
Correct Answer:Left hand
Model Results:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 |
Sunlight and ShadowsProblem Text:At noon, the sun is directly overhead. Where does your shadow point?
Grading Rules:must mention shadow is directly below or very small
Correct Answer:Directly under you / very short shadow
Model Results:
|