Improve CoT suppport

```
[[evals]]
model = "gemini:gemini-1.5-pro-latest"
prompt = """
Question: Is the following statement true or false? "Python was created before Java."

Think through this step by step:

Step 1: When was Python created?
Step 2: When was Java created?
Step 3: Compare the dates

Format your response as:
1. [Step 1 reasoning]
2. [Step 2 reasoning]
3. [Step 3 reasoning]

Final Answer: True/False - [brief explanation]
"""
judge_model = "gemini:gemini-1.5-pro-latest"
expected = "True - Python (1991) was created before Java (1995)"
```

re: gemini

✅ Use natural numbered steps in your prompts ("1. ... 2. ... 3. ...")
✅ Your parsing code handles it - markdown parsing will catch Gemini's output
⚠️ Gemini Pro is best - Flash models are weaker at complex reasoning
⚠️ Don't expect strict XML adherence - Gemini is more "conversational"
✅ Use Claude/GPT-4 as judge - they're better at evaluating reasoning quality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CoT suppport #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve CoT suppport #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions