Skip to content

feat(connector): Notion database target#2049

Open
badmonster0 wants to merge 17 commits into
mainfrom
feat/notion-target-connector
Open

feat(connector): Notion database target#2049
badmonster0 wants to merge 17 commits into
mainfrom
feat/notion-target-connector

Conversation

@badmonster0

@badmonster0 badmonster0 commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

Adds cocoindex.connectors.notion — a declarative target for Notion databases (called data sources in the 2025-09-03 API), with the same upsert + automatic-delete reconcile semantics as connectors.postgres.

Two modes:

  • managed_by="user" (default) — point at an existing data source; the connector validates the live property schema matches at mount, then syncs rows.
  • managed_by="system" — give the connector a parent page or database plus a title; it finds or creates the data source on first run, PATCH-adds new properties as the dataclass grows, and rejects destructive changes unless allow_destructive=True.

The archive-on-undeclare path is the main thing the hand-rolled HTTP code in cocoindex-gtm couldn't do — drop a row from the declared set, re-run, the matching Notion page gets archived.

Design doc (approved): https://www.notion.so/372daa511a0880fea8c2d9852b1d9f82

What's in this PR

  • Connector (python/cocoindex/connectors/notion/):

    • NotionClient — token-scoped, async, rate-limited (3 req/s semaphore), Retry-After-aware exponential backoff (inline manual retry — no tenacity dep).
    • DatabaseSchema — binds a Python row class to Notion properties via Annotated[T, notion.SomeProp(...)] (or property_map={...}). Validates against the live data source at mount; diff_against() splits additive from destructive changes.
    • mount_database_target(client, data_source_id=None, schema, *, managed_by, parent_page_id, parent_database_id, title, on_delete, allow_destructive) — plus the lower-level database_target and declare_database_target.
    • 9 property types: TitleProp, RichTextProp, NumberProp, UrlProp, EmailProp, SelectProp, MultiSelectProp, DateProp, CheckboxProp. Each has encode / decode / to_notion_schema() for create/PATCH bodies.
    • Query-on-miss page_id resolution (one Notion query per unknown PK, cached for the rest of the run).
    • OnDelete.ARCHIVE (default) / HARD / IGNORE.
  • Test suite (python/tests/connectors/test_notion_target.py, 12 cases — all passing locally against a real Notion workspace):

    • 3 schema-validation tests that don't touch Notion (run anywhere):
      • test_property_map_typo_raises
      • test_schema_requires_at_most_one_title
      • test_managed_by_args_validation
    • 9 integration tests gated by NOTION_TEST_TOKEN + NOTION_TEST_PARENT_PAGE:
      • test_insert_update_archive — full lifecycle: insert 3, update 1, drop 1 → drops archived
      • test_on_delete_ignore_leaves_pageOnDelete.IGNORE doesn't touch the page on undeclare
      • test_on_delete_hardOnDelete.HARD actually removes from active queries
      • test_noop_when_no_changes — re-run with identical data → zero PATCHes (verified via last_edited_time)
      • test_property_types_roundtrip — title + rich_text + number + url + checkbox + select + date all round-trip cleanly
      • test_first_run_against_existing_page — pre-seeded page with declared PK gets PATCHed, not duplicated (query-on-miss happy path)
      • test_schema_validation_type_mismatch — wrong type at mount → ValueError, zero writes
      • test_schema_validation_missing_property — missing column at mount → ValueError, zero writes
      • test_system_creates_and_evolves — system mode creates the data source on first run, PATCH-adds a new property on the second run

    Suite runs in ~42s end-to-end. Each integration test creates its own temp data source and archives it in teardown.

  • Docs: docs/src/content/docs/connectors/notion.mdx covers connection setup, both modes, all property types, delete strategies, page-id persistence, and the four Notion-API setup gotchas (integration sharing, parent access, internal vs public integrations, API version pinning). Sidebar entry added.

  • Example: examples/notion_target_basics/ — runnable demo with Person rows + README showing the insert / no-op / archive lifecycle.

  • Packaging: pyproject.toml gains a notion optional extra (aiohttp only).

Deferred to follow-up PRs

  • RelationProp / PeopleProp / FilesProp — these prop types aren't enumerated in the design doc and relations specifically need cross-target ordering, which is its own design problem. Out of scope here, happy to add RelationProp if you want.

CI status

  • fast-check (ruff format/lint, end-of-file fixers, etc.): pass
  • e2e-type-check (strict mypy, Python 3.11–3.14): pass on all four versions
  • build-test (Rust compile + pytest across Linux/macOS/Windows): in progress — Linux 3.11, Linux 3.14, macOS 3.11 already passing.

Test plan

  • All 12 tests pass locally against a real Notion workspace (pytest python/tests/connectors/test_notion_target.py, ~42s).
  • Connector imports clean (from cocoindex.connectors import notion).
  • Manual end-to-end via the examples/notion_target_basics demo also still works.
  • CI matrix entries for macOS-15-intel, macOS-3.14, Ubuntu-arm, Windows-3.11, Windows-3.14 still pending — those are slower architectures and will signal whether the test pattern is portable.

🤖 Generated with Claude Code

badmonster0 and others added 3 commits May 31, 2026 21:09
Adds cocoindex.connectors.notion — a declarative target connector for
Notion databases (data sources in the 2025-09-03 API), mirroring the
two-level pattern from connectors.postgres / connectors.sqlite.

User declares a Python row class and calls declare_row(); CocoIndex
keeps the Notion data source in sync — creating new pages, patching
changed rows, and archiving pages whose source row falls out of the
declared set. The archive-on-undeclare path is the main thing the
hand-rolled HTTP plumbing in cocoindex-gtm couldn't do.

Phase 1 scope:
- managed_by="user" only — the data source must exist and be shared
  with the integration. Schema is validated against the live data
  source at mount; mismatches fail loudly instead of producing empty
  cells at write time.
- 9 property types: title, rich_text, number, url, email, select,
  multi_select, date, checkbox.
- Query-on-miss page_id resolution (one PK-filter call per cache miss,
  cached for the rest of the run).
- 3 req/s rate limit + Retry-After honored + tenacity retry on 429.
- OnDelete.ARCHIVE (default) / HARD / IGNORE.

Out of scope (follow-ups):
- managed_by="system" additive mode (auto-create the data source,
  PATCH new properties as the dataclass grows; destructive ops blocked
  unless allow_destructive=True).
- RelationProp / PeopleProp / FilesProp.
- Automated test suite (validated end-to-end by hand against a real
  Notion workspace; CI gating on NOTION_TEST_TOKEN is a follow-up).

Includes:
- python/cocoindex/connectors/notion/ — _client.py, _types.py,
  _target.py, __init__.py
- examples/notion_target_basics/ — minimal demo with README
- docs/src/content/docs/connectors/notion.mdx + sidebar entry
- pyproject.toml: notion optional extra (aiohttp, tenacity)

Design doc:
https://www.notion.so/372daa511a0880fea8c2d9852b1d9f82

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Type all `dict` annotations with `dict[str, Any]` (strict mypy mode
  requires explicit type parameters for generics).
- Add `tenacity` to the mypy missing-imports overrides — tenacity has
  no type stubs, so the @Retry decorator was tagged as untyped.
- Annotate _provider on DatabaseTarget, fix Any-returning page_id and
  memo_key returns by tagging the local variable type.
- Re-format with ruff to match CI version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
managed_by="system" mode (additive)
-----------------------------------
- New kwargs on mount_database_target / database_target /
  declare_database_target: managed_by, parent_page_id,
  parent_database_id, title, allow_destructive.
- System mode looks under the given parent for a Notion database / data
  source with the matching title; finds-or-creates on first run via
  POST /v1/databases or POST /v1/data_sources, and PATCH-adds new
  properties on subsequent runs when the dataclass grows.
- Destructive changes (existing property's type changed) are rejected
  at mount unless allow_destructive=True. Type signatures are kept tight
  with a new ManagedBy = Literal["user", "system"] alias.
- New per-PropType to_notion_schema() returns the schema body for create
  / PATCH calls. SelectProp / MultiSelectProp gained an optional
  `options=("Foo", "Bar")` field for pre-declared select options.
- New DatabaseSchema methods: to_notion_properties() (the full body for
  create) and diff_against() (additive-vs-destructive split, shared
  between user-mode validation and system-mode evolution).
- _client.py: 3 new methods (create_database, create_data_source,
  update_data_source_properties) and get_database. _request is now a
  manual retry loop (no @Retry decorator) for clean typing.

Test suite
----------
- python/tests/connectors/test_notion_target.py: 8 cases.
- 3 schema-validation tests run without Notion access (typo in
  property_map, two-titles check, managed_by-args validation).
- 5 integration tests gated by NOTION_TEST_TOKEN + NOTION_TEST_PARENT_PAGE:
  insert/update/archive, on_delete=IGNORE behavior, schema-mismatch
  type/missing checks, and system-mode create+evolve. Each integration
  test creates its own temp data source and archives it in teardown.

Notes
-----
- pyproject.toml: notion extra no longer depends on tenacity (replaced
  with an inline manual retry); tenacity removed from `all` and from
  the mypy missing-imports overrides.
- Docs page (docs/src/content/docs/connectors/notion.mdx) updated with
  a system-mode example and the full new mount signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@badmonster0 badmonster0 requested a review from georgeh0 June 1, 2026 04:31
badmonster0 and others added 10 commits May 31, 2026 21:41
The original lifecycle test used distinct App names per update call, which
gives cocoindex no prior tracking record to reconcile against — so the
archive step silently no-op'd. Reuse the same App name across all steps so
the tracking carries forward.

Same fix for test_on_delete_ignore_leaves_page.

New tests (12 total now, all passing locally with NOTION_TEST_TOKEN +
NOTION_TEST_PARENT_PAGE):

- test_noop_when_no_changes: re-run with identical rows must not touch
  Notion (verified via last_edited_time on each page being unchanged).
- test_on_delete_hard: OnDelete.HARD path actually removes the page from
  active queries.
- test_property_types_roundtrip: title + rich_text + number + url +
  checkbox + select + date all encode -> Notion -> decode without
  corruption.
- test_first_run_against_existing_page: if a page with the declared PK
  already exists in Notion (pre-seeded), the connector PATCHes it
  instead of POSTing a duplicate (exercises the query-on-miss happy
  hit path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes a gap from the design doc's 14-case test plan: two
mount_database_target calls in one app must sync independently.

Catches the class of bug where per-target state (page_id cache,
asyncio locks, tracking record identity) would accidentally be shared
across targets. Verifies isolation at both insert (each target gets its
own row) and undeclare (dropping rows from one target doesn't affect
the other).

Uses coco.use_mount with explicit component_subpath so each target gets
a stable, distinct subpath across runs — same pattern as
test_sqlite_target.test_multiple_tables.

The other gap from the original 14-case plan, same-PK dedup, is left
out by design: it's a cocoindex framework invariant
(declare_target_state collapses same-StableKey calls before the
connector sees anything), and neither test_sqlite_target.py nor
test_postgres_target.py tests it for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds RelationProp to the supported property type set so users with
linked Notion data sources (e.g. Signals -> Account + Developer in the
cocoindex-gtm CRM) can port to the new target.

Encoding takes a list of page IDs (or a single string); decoding returns
the list back. to_notion_schema() emits the minimal `{"relation": {}}`
when no target_database_id is provided (sufficient for user-managed
mode where the column already exists), or the full single_property body
when target_database_id is given (for managed_by="system" create).

Verified end-to-end via the cocoindex-gtm pipeline port:
- 20 GitHub signals processed; 6 wrote to Notion (the rest skipped
  because the user has no resolved company — existing GTM rule).
- Each Notion Signal row has the right ID, Account relation, and
  Developer relation.
- Re-run with one stargazer dropped via GTM_SKIP_USERS=badmonster0:
  cocoindex reports `process_signal: 20 total | 19 reprocessed,
  1 deleted` and the connector archives the orphaned Notion page.
  Notion confirms 5 active rows (was 6) and 0 active badmonster0 rows.

This is the regression test for George's concern from the design doc:
declare a row, stop declaring it, assert it's archived. Until now the
hand-rolled notion_client.py could only upsert; the new target makes
the delete path automatic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/cocoindex/connectors/notion/_target.py Outdated
badmonster0 and others added 4 commits June 1, 2026 14:26
Match every other target connector (postgres, sqlite, qdrant, lancedb,
…), which default managed_by to "system". The Notion connector was the
lone outlier at "user".

Make managed_by="user" explicit in the user-mode example, docs snippets,
and tests that pass data_source_id positionally, since those now require
user mode to be requested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Reorder ManagedBy Literal to ["system", "user"] (default first), matching
  the other connector docs/types.
- Drop stale "in the follow-up" framing for system mode — it's the default
  and implemented now.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restructure the basics example and docs walkthrough to teach the
managed_by="system" path first (CocoIndex creates + evolves the "People"
database under a parent page), with managed_by="user" demoted to a variant.
Setup now uses NOTION_PARENT_PAGE instead of NOTION_DATA_SOURCE_ID.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants