[[evals]]
model = "gemini:gemini-1.5-pro-latest"
prompt = """
Question: Is the following statement true or false? "Python was created before Java."
Think through this step by step:
Step 1: When was Python created?
Step 2: When was Java created?
Step 3: Compare the dates
Format your response as:
1. [Step 1 reasoning]
2. [Step 2 reasoning]
3. [Step 3 reasoning]
Final Answer: True/False - [brief explanation]
"""
judge_model = "gemini:gemini-1.5-pro-latest"
expected = "True - Python (1991) was created before Java (1995)"
re: gemini
✅ Use natural numbered steps in your prompts ("1. ... 2. ... 3. ...")
✅ Your parsing code handles it - markdown parsing will catch Gemini's output
⚠️ Gemini Pro is best - Flash models are weaker at complex reasoning
⚠️ Don't expect strict XML adherence - Gemini is more "conversational"
✅ Use Claude/GPT-4 as judge - they're better at evaluating reasoning quality
re: gemini
✅ Use natural numbered steps in your prompts ("1. ... 2. ... 3. ...")
⚠️ Gemini Pro is best - Flash models are weaker at complex reasoning
⚠️ Don't expect strict XML adherence - Gemini is more "conversational"
✅ Your parsing code handles it - markdown parsing will catch Gemini's output
✅ Use Claude/GPT-4 as judge - they're better at evaluating reasoning quality