Skip to content

yc1838/lilith-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

80 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

title Lilith Agent
emoji πŸ¦‹
colorFrom pink
colorTo purple
sdk gradio
sdk_version 5.25.2
app_file app.py
pinned false
hf_oauth true
hf_oauth_expiration_minutes 480

Lilith Agent

πŸ¦‹ A ReAct research assistant built on LangGraph. Lilith plans, calls tools, and answers open-ended research questions from a TUI or a batch runner over the GAIA benchmark. Leaderboard shows Currently 95% on GAIA level 1 benchmark under the Username: yc1838 GAIA Showoff

Features

  • Explicit ReAct graph β€” tool-call dedup, per-tool error feedback, recursion cap, iteration fail-safe
  • Three-layer persistent memory β€” short-term thread checkpoints, long-term semantic facts (LangMem), episodic task experiences; inspired by the Engram memory architecture
  • Tool belt β€” web search, URL fetch, sandboxed Python, file I/O, PDF, audio/video transcription, YouTube frame extraction, vision (Gemini + FAL fallbacks), arXiv, CrossRef, todos
  • Multi-provider routing β€” cheap / strong / extra-strong model tiers with independent provider+model config
  • Observability β€” per-session JSONL trace + rotating log file, optional Arize AX + LangSmith tracing
  • Caveman mode β€” compresses the system prompt so the model responds tersely (lite / full / ultra)

Install

pip install -e .
# or for a pinned snapshot:
pip install -r requirements.txt

Also need ffmpeg on PATH for YouTube frame extraction. If missing, imageio-ffmpeg (bundled via deps) is used as a fallback.

Configure

Copy .env.example (or create .env) with at least:

GAIA_ANTHROPIC_API_KEY=sk-ant-...
GAIA_GOOGLE_API_KEY=...
GAIA_TAVILY_API_KEY=tvly-...
GAIA_FAL_VISION_API_KEY=fal-...        # optional, for FAL vision
GAIA_HUGGINGFACE_API_KEY=hf_...        # optional, for GAIA dataset

# Optional tracing
ARIZE_SPACE_ID=...
ARIZE_API_KEY=...
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=lsv2_pt_...
LANGCHAIN_PROJECT="Lilith Agent"

Model routing (all optional, shown with defaults):

GAIA_CHEAP_PROVIDER=google
GAIA_CHEAP_MODEL=gemini-3-flash-preview
GAIA_STRONG_PROVIDER=anthropic
GAIA_STRONG_MODEL=claude-sonnet-4-6
GAIA_EXTRA_STRONG_PROVIDER=anthropic
GAIA_EXTRA_STRONG_MODEL=claude-sonnet-4-6
GAIA_AGENT_MODEL_TIER=extra_strong  # cheap | strong | extra_strong
# Optional one-off override for the main agent model.
# GAIA_AGENT_PROVIDER=deepseek
# GAIA_AGENT_MODEL=deepseek-v4-pro

# DeepSeek uses an OpenAI-compatible API.
# DEEPSEEK_API_KEY=...
# GAIA_DEEPSEEK_BASE_URL=https://api.deepseek.com
# GAIA_CHEAP_PROVIDER=deepseek
# GAIA_CHEAP_MODEL=deepseek-v4-flash
# GAIA_STRONG_PROVIDER=deepseek
# GAIA_STRONG_MODEL=deepseek-v4-pro

GAIA_VISION_PROVIDER=fal
GAIA_VISION_MODEL=gemini-3-flash-preview
GAIA_CAVEMAN=true
GAIA_CAVEMAN_MODE=full
GAIA_RECURSION_LIMIT=50
GAIA_BUDGET_HARD_CAP=25
GAIA_BUDGET_WARN_AT=15
GAIA_SEMANTIC_DEDUP_THRESHOLD=0.5

Run

Interactive TUI

lilith
# or
python -m lilith_agent.tui

The TUI prints the logo, caveman status, and the trace file path. Type your question at the lilith πŸ¦‹ > prompt.

Slash commands:

Command Effect
/clear Wipe conversation memory, start a new thread
/memory list Show all stored facts and recent episodic experiences
/memory forget <id> Delete a fact by ID prefix
/memory reflect Manually trigger long-term memory extraction for the current thread
/caveman Toggle caveman on/off
/caveman off / /caveman on Explicit on/off
/caveman lite Lightest β€” keep articles & full sentences, cut fluff
/caveman full Default β€” drop articles, fragments OK (classic caveman)
/caveman ultra Heaviest β€” abbreviations, arrows for causality
exit / quit Leave

Batch run over GAIA

python scripts/dev_run_gaia.py --limit 3 --level 1
python scripts/dev_run_gaia.py --task-id c61d22de-5f6c-4958-a7f6-5e9707bd3466
# Runs all level-one test questions with caveman mode. Rerun without --force to resume.
python scripts/dev_run_gaia.py --split test --level 1 --limit -1 --cavemen --caveman-mode ultra
# Without caveman mode β€” set GAIA_CAVEMAN=false in .env beforehand.
python scripts/dev_run_gaia.py --split test --level 1 --limit 5

Build a leaderboard submission

python scripts/build_leaderboard_submission.py --split test --out submission.jsonl --pad-missing
# Upload submission.jsonl to https://huggingface.co/spaces/gaia-benchmark/leaderboard/submit

Per-question checkpoints land in .checkpoints/<task_id>.json. Reruns skip existing checkpoints by default. To overwrite fresh answers, use --force on the selected scope:

# Rerun and overwrite one task.
python scripts/dev_run_gaia.py --split test --task-id <task_id> --force

# Rerun and overwrite all level-one test tasks.
python scripts/dev_run_gaia.py --split test --level 1 --limit -1 --force

After any rerun, rebuild submission.jsonl; the builder reads the latest checkpoint files. The GAIA leaderboard expects the full test split: 93 level-1 rows, 159 level-2 rows, and 49 level-3 rows. Use --pad-missing so unanswered tasks are emitted as blank placeholders and the file has the required 301 rows:

python scripts/build_leaderboard_submission.py \
  --checkpoint-dir .checkpoints \
  --split test \
  --out submission.jsonl \
  --pad-missing

wc -l submission.jsonl  # should print 301

Tools

All tools live under src/lilith_agent/tools/ and are registered in __init__.py:

Tool Purpose
web_search, fetch_url Primary web search + page fetch
run_python Sandboxed Python subprocess (bs4, pandas, trafilatura, pypdf, …)
read_file, ls, grep, glob_files, write_file Local filesystem
transcribe_audio faster-whisper
youtube_transcript Spoken-word captions only
youtube_frame_at Download + extract one frame at a timestamp (PNG)
inspect_pdf PDF β†’ text
inspect_visual_content Multimodal vision (Gemini + FAL moondream/llava fallbacks)
arxiv_search, crossref_search, count_journal_articles, filter_entities Academic metadata
todos High-level planning
search_memory Query Lilith's long-term memory (facts + episodes) by keyword

Vision fallback chain

inspect_visual_content tries in order: configured provider+model β†’ same-provider stable fallback β†’ cross-provider last-resort (gemini-3-flash-preview on Google). If all fail, it trips a session-level circuit breaker so future calls return a clean error message instead of looping.

Observability

  • Logs: .lilith/session-<timestamp>.log (WARNING+ to stderr, INFO+ to file)
  • Trace: .lilith/session-<timestamp>.jsonl β€” full LLM/tool/chain events, flushed per line, replay-able
  • Arize AX: auto-enabled when ARIZE_SPACE_ID + ARIZE_API_KEY are set
  • LangSmith: set LANGCHAIN_TRACING_V2=true + LANGCHAIN_API_KEY + LANGCHAIN_PROJECT

Project layout

src/lilith_agent/
  app.py             # ReAct graph, model routing, caveman prompt wrapping
  tui.py             # interactive loop, slash commands, rich output
  runner.py          # batch runner over GAIA questions
  memory.py          # three-layer memory: checkpoints, semantic facts, episodic
  config.py          # Config.from_env(), model + API key + feature flags
  observability.py   # logging, Arize setup, JsonlTraceCallback
  models.py          # provider -> chat model builder
  gaia_dataset.py    # HF GAIA dataset loader
  tools/             # LangChain @tool wrappers + impls
scripts/
  dev_run_gaia.py    # CLI to run against real GAIA questions
.checkpoints/        # per-question answers (gitignored)
.lilith/             # session logs, JSONL traces, long_term_memory.sqlite (gitignored)

Memory system

Lilith uses a three-layer persistent memory architecture loosely inspired by the Engram memory model:

Layer Storage Role
Short-term (thread checkpoints) .lilith/threads.sqlite via LangGraph SqliteSaver Preserves full conversation state across restarts within a thread
Long-term semantic (facts) .lilith/long_term_memory.sqlite Extracts and deduplicates user preferences, names, project details using LangMem; injected into the system prompt on new queries
Episodic (task experiences) .lilith/long_term_memory.sqlite Summarises past tool trajectories β€” what failed, what worked β€” so Lilith avoids repeating mistakes

The semantic layer uses LangMem as a memory governance engine (extraction, conflict resolution, forgetting) while SQLite provides local, auditable, migratable persistence. Long-term memory extraction runs automatically after each conversation and can be triggered manually with /memory reflect. The agent can also call search_memory during reasoning when the system-prompt injection has been truncated.

For batch GAIA runs each question gets an isolated ephemeral memory store (MemoryStore(":memory:")) so questions cannot contaminate each other.

Testing

pytest

Memory tests can use the ephemeral_memory() context manager for isolated in-memory stores:

from lilith_agent.memory import ephemeral_memory

def test_something():
    with ephemeral_memory() as store:
        store.add_episode("task", "summary", "success")
        assert len(store.get_recent_episodes()) == 1
    # store discarded on exit, no disk writes

About

πŸ¦‹ A ReAct agent built on LangGraph. Lilith plans, calls tools, and answers open-ended research questions from a TUI or a batch runner over the GAIA benchmark. Leaderboard shows Currently 95% on GAIA level 1 benchmark under the Username: yc1838

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages