This repository hosts source code for my master's thesis 'Collaborative Multi-Agent Architecture for Domain-Agnostic Named Entity Recognition'. Instructions to run the benchmarks and reproduce the results from thesis can be found below.
-
Requirements:
- Python 3.12+
-
Install Poetry:
curl -sSL https://install.python-poetry.org | python3 - -
Create a virtual environment and install the dependencies
poetry install- Activate the virtual environment
poetry shellThe repo includes a command-line interface to run various benchmarks and evaluate different variants of AgenticNER.
- Get API key from Anthropic and export the API key
export ANTHROPIC_API_KEY="<your api key>"- Get API key from tavily.com to enable internet access for research agent and export the API key
export TAVILY_API_KEY="<your api key>"-
Set Anthropic API key in
config.yamlto start LiteLLM proxy to create OpenAI compatible interface for Anthropic models. This is required byAutoGen. -
Start LiteLLM proxy in separate command line session
poetry run litellm --config config.yaml- Export LiteLLM server address as OpenAI base URL
export OPENAI_BASE_URL=http://0.0.0.0:4000poetry run python src/ner/eval/run.py --benchmark <benchmark> --variant <variant> [--llm <llm>] [--sample-size <size>]--benchmark: Choose the benchmark dataset- Options:
genia,music,buster,astro
- Options:
--variant: Choose the evaluation variantfew-shot: Basic few-shot learning approachagentic-ner-no-grounding: AgenticNER without groundingagentic-ner-grounding: AgenticNER with grounding enabledagentic-ner-grounding-no-internet: AgenticNER with grounding but no internet accessagentic-ner-grounding-no-researcher: AgenticNER with grounding but no researcher agent
--llm: Choose the LLM model (default: haiku)- Options:
haiku,sonnet
- Options:
--sample-size: Number of samples to evaluate (default: 500)
# Run few-shot evaluation on GENIA using Haiku
poetry run python src/ner/eval/run.py --benchmark genia --variant few-shot
# Run full AgenticNER on MusicRecoNER using Sonnet
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-grounding --llm sonnet
# Run AgenticNER without internet on Buster with 100 samples
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-grounding-no-internet --sample-size 100To reproduce the results from the thesis, run the following commands for each benchmark:
# Baseline (Few-shot single LLM call with Haiku)
poetry run python src/ner/eval/run.py --benchmark genia --variant few-shot
# Baseline (Few-shot single LLM call with Sonnet)
poetry run python src/ner/eval/run.py --benchmark genia --variant few-shot --llm sonnet
# AgenticNER variants
poetry run python src/ner/eval/run.py --benchmark genia --variant agentic-ner-no-grounding
poetry run python src/ner/eval/run.py --benchmark genia --variant agentic-ner-grounding
poetry run python src/ner/eval/run.py --benchmark genia --variant agentic-ner-grounding-no-internet
poetry run python src/ner/eval/run.py --benchmark genia --variant agentic-ner-grounding-no-researcher
poetry run python src/ner/eval/run.py --benchmark genia --variant agentic-ner-grounding --llm sonnetpoetry run python src/ner/eval/run.py --benchmark music --variant few-shot
poetry run python src/ner/eval/run.py --benchmark music --variant few-shot --llm sonnet
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-no-grounding
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-grounding
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-grounding-no-internet
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-grounding-no-researcher
poetry run python src/ner/eval/run.py --benchmark music --variant agentic-ner-grounding --llm sonnetpoetry run python src/ner/eval/run.py --benchmark buster --variant few-shot
poetry run python src/ner/eval/run.py --benchmark buster --variant few-shot --llm sonnet
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-no-grounding
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-grounding
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-grounding-no-internet
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-grounding-no-researcher
poetry run python src/ner/eval/run.py --benchmark buster --variant agentic-ner-grounding --llm sonnetpoetry run python src/ner/eval/run.py --benchmark astro --variant few-shot
poetry run python src/ner/eval/run.py --benchmark astro --variant few-shot --llm sonnet
poetry run python src/ner/eval/run.py --benchmark astro --variant agentic-ner-no-grounding
poetry run python src/ner/eval/run.py --benchmark astro --variant agentic-ner-grounding
poetry run python src/ner/eval/run.py --benchmark astro --variant agentic-ner-grounding-no-internet
poetry run python src/ner/eval/run.py --benchmark astro --variant agentic-ner-grounding-no-researcher
poetry run python src/ner/eval/run.py --benchmark astro --variant agentic-ner-grounding --llm sonnet