Skip to content

Add llms.txt for AI-friendly article extraction #485

@MaxGhenis

Description

@MaxGhenis

Summary

Implement the llms.txt standard to make PolicyEngine research articles more token-efficient for AI extraction. This follows the pattern used by Bun, Svelte, and other projects.

Problem

When AIs extract our articles, they consume excessive tokens due to:

  • Embedded Plotly chart JSON (40-100 lines of styling per chart)
  • Need to crawl multiple pages
  • No machine-readable summary format

Proposed Solution

Files to Generate

File Contents Size Est.
/llms.txt Index with links to sections ~2KB
/llms-full.txt All articles combined, charts replaced with summaries ~500KB
/llms-research-us.txt US articles only ~200KB
/llms-research-uk.txt UK articles only ~200KB

Format

# PolicyEngine Research

> PolicyEngine analyzes tax and benefit policy impacts through microsimulation modeling for the US and UK.

## Recent Research

- [Article Title](slug): One-line description

## Docs

- [API Documentation](/docs/api.md)
- [Python Package](/docs/python.md)

## Full Articles

- [US Research](/llms-research-us.txt)
- [UK Research](/llms-research-uk.txt)

Article Format in llms-full.txt

---
# Article Title
Date: 2025-01-15
Authors: Max Ghenis
Tags: us, tax, reform
---

Article content here...

**Figure 1: Distributional Impact by Income Decile**
<!-- Chart showing: Bottom decile gains $X, top decile loses $Y. Progressive overall. -->

More content...

---

Key changes from raw articles:

  • Plotly JSON replaced with text descriptions of what charts show
  • Consistent YAML-style header
  • Delimiter between articles

Implementation

Build Script

Create scripts/generate-llms-txt.ts that:

  1. Reads all articles from app/src/data/posts/articles/
  2. Reads metadata from posts.json
  3. Strips Plotly JSON, replaces with chart caption/summary
  4. Concatenates into single files by region
  5. Generates index llms.txt

CI Integration

Add to build process so files are regenerated on each deploy.

Workflow for PRs Adding New Articles

When adding a new article:

  1. No extra work required - the build script auto-generates llms.txt files
  2. Optional: Add ai_summary field to your post in posts.json:
    {
      "title": "Rail Fares Freeze Analysis",
      "ai_summary": "Analyzes 2025 rail fares freeze: costs £X, benefits higher earners disproportionately, top decile receives Y% of benefit."
    }
  3. For charts: Use descriptive captions that explain the key takeaway:
    **Figure 1: Winners and losers by income decile**
    The caption becomes the chart summary in llms.txt.

Chart Summary Generation

For Plotly charts without good captions, the script will:

  1. Use the **Figure X:** caption if present
  2. Fall back to extracting axis labels from the JSON
  3. Mark as [Chart: see original article] if no info available

Tasks

  • Create scripts/generate-llms-txt.ts
  • Add chart-to-summary extraction logic
  • Add optional ai_summary field to posts.json schema
  • Integrate into build process
  • Add to CI workflow
  • Document in CONTRIBUTING.md or similar

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions