fix(api): add cron job to clean up UserSession by NirajNair · Pull Request #4105 · hatchet-dev/hatchet

NirajNair · 2026-06-06T17:05:56Z

Description

Introduces cron scheduler in the API layer and a cron job that runs every hour deleting expired and orphaned UserSession rows addressing the unbounded table growth issue reported in #3913.

Based on the ~1 row/sec insertion rate reported in the issue, approx. 3600 rows accumulate per hour - hence choosing 1-hour cron interval that balances timely cleanup.

As per this comment, UserSession rows are deleted when:

expiresAt < NOW() OR
userId = NULL AND createdAt < NOW() - 24 hours

Follow up fix for issue #3913

Type of change

Bug fix (non-breaking change which fixes an issue)

What's Changed

Added a cron scheduler and a job in API layer that runs every 1 hour to delete expired and orphaned UserSession

Checklist

Tested (unit, integration, or manually with steps specified)
Linted and formatted
Documented (where applicable)
Added to CHANGELOG (where applicable) -- see Keep a Changelog

🤖 AI Disclosure

I acknowledge that an LLM was used in the creation of this Pull Request, in accordance with Hatchet's AI_POLICY.md.
Details: I've used Pi coding agent (model: Minimax M2.7) to understand repo structure, code patterns, compare and align implementation with existing cron job patterns, verify naming conventions, and generate tests.

vercel · 2026-06-06T17:06:01Z

@NirajNair is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

gregfurman · 2026-06-08T14:58:01Z

+// TestCleanupUserSessions verifies the cleanup logic handles both conditions:
+// 1. Expired sessions (expiresAt < NOW())
+// 2. Unauthenticated sessions (userId IS NULL) older than 24 hours


Can we remove all of this inline documentation?

gregfurman · 2026-06-08T15:10:38Z

Q: I'm wondering if this implementation should instead be in internal/services/controllers/retention instead of at the api layer -- especially since this would allow cleanups to respect a users data retention settings. I'm not sold on either approach so would be curious to hear your thoughts!

cc @abelanger5

I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair

I agree because I think internal/services/controllers/retention already has a cron scheduler and the tasks that it runs looks like maintenance tasks, similar to this one. Also, we won't be introducing a new scheduler in the API layer at all if we move.

gregfurman · 2026-06-08T15:15:27Z

 	Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
 	GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
+
+	CleanupUserSessions(ctx context.Context) (int, error)


Unsure we really need the no. of sessions cleaned up as a return type. Is this just for logging?

Yes, its just for logging.

If it's just for logging then I think it's safe to just replace it with a single error type.

gregfurman · 2026-06-08T15:27:22Z

+		if err != nil {
+			return totalDeleted, err
+		}


If no sessions have expired (and no data is found), we're going to be returning a false positive pgx.ErrNoRows type -- which would be logged as an error etc.

Think we should be guarding against this with a errors.Is(err, pgx.ErrNoRows) check that returns a nil error if true.

gregfurman · 2026-06-08T15:35:56Z

+}
+
+func (c *CronScheduler) Start() (func() error, error) {
+	ctx, cancel := context.WithCancel(context.Background())


nit: we could have a var err error defined with a defer e.g.

var err error defer func() { if err != nil { cancel() } }()

That way, we automatically cancel the operation if err an error is returned.

gregfurman · 2026-06-08T15:37:20Z

+}
+
+func (c *CronScheduler) Start() (func() error, error) {
+	ctx, cancel := context.WithCancel(context.Background())


super-nit: Although we don't do this anywhere else, could be a nice opportunity to use a context.WithCancelCause function and pass the err as the reason 👀

gregfurman · 2026-06-08T15:38:03Z

+// TestCleanupUserSessions_MultiBatch verifies cleanup works correctly when
+// there are more rows than the batch size (1000), forcing multiple batches.


Also these comments. They really just pollute the diff

gregfurman · 2026-06-08T15:52:56Z

+   LIMIT @batchSize::int
+)
+DELETE FROM "UserSession"
+WHERE "id" IN (SELECT "id" FROM sessions_to_delete);


Why can we not just do a DELETE FROM? Why do we need to collect the sessions beforehand?

Cleanup cron deletes rows in batches of 1000 and for that we need LIMIT clause but DELETE (similar to UPDATE) does not support LIMIT clause (PostgresSQL Docs). I believe ORDER BY is not supported either as I couldn't find it in the synopsis of both operations.

PostgresSQL docs recommend to use CTE for batch deletes/updates. Also, this query pattern already exists in the repo, for eg. in CleanupV1QueueItem query.

One improvement that we can do:
change:
DELETE FROM "UserSession" WHERE "id" IN (SELECT "id" FROM sessions_to_delete);

to:
DELETE FROM "UserSession" AS us USING sessions_to_delete AS s WHERE us."id" = s."id";

Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.

Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.

From the original issue:

In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.

SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL) FROM "UserSession"; -- 2,604,487 anonymous | 9 with userId (4 real users)

Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.

If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.

Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.

Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.

@abelanger5 would love your input here!

OK chatted internally. Think this approach as-is is fine. However, we should add something to the release notes/changelog (in CHANGELOG.md) about the drop invalid approach since there are currently users with a large backlog of undeleted rows.

Do you have capacity to add a section to the CHANGELOG.md on this? I'm thinking something like:

## [Unreleased] ### Upgrade Notes * This update includes an additional cleanup job for the `UserSessions` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup: ```sql -- creates a tmp_UserSessions from only valid UserSession rows, -- drops the whole UserSessions table, renames the tmp_UserSessions -- to UserSessions. ```

(Unreleased may not exist... if it doesn't we should add it)

Sure, will update CHANGELOG.md.

Quick Q: should we also have an Added section under Unreleased?

gregfurman · 2026-06-09T09:26:14Z

 	Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
 	GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
+
+	CleanupUserSessions(ctx context.Context) (int, error)


If it's just for logging then I think it's safe to just replace it with a single error type.

gregfurman · 2026-06-09T12:37:17Z

+          AND "createdAt" < NOW() - INTERVAL '24 hours'
+      )
+   ORDER BY "createdAt" ASC
+   LIMIT @batchSize::int


Q: Wonder if we should be using a FOR UPDATE SKIP LOCKED here to prevent data races 🤔 Or is it fine since we don't expect to be concurrently writing to the same rows only reading?

I don't think it will cause any data races but using FOR UPDATE SKIP LOCKED will make the job efficient by letting other API instances SELECT rows that are not already locked by some other instance.

gregfurman · 2026-06-09T12:38:35Z

+   LIMIT @batchSize::int
+)
+DELETE FROM "UserSession"
+WHERE "id" IN (SELECT "id" FROM sessions_to_delete);


Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.

Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.

From the original issue:

In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.

SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL) FROM "UserSession"; -- 2,604,487 anonymous | 9 with userId (4 real users)

Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.

If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.

Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.

Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.

@abelanger5 would love your input here!

gregfurman

Hey @NirajNair Have left 2 more comments. I think this should move to the internal/services/controllers/retention since that means we can control it via the user-configured data retention settings. Also, we should definitely be adding something to the CHANGELOG.md on this.

Looking forward to getting this merged 🙌

gregfurman · 2026-06-12T11:13:31Z

I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair

gregfurman

This is looking great! Left some non-blocking comments. Only actionable thing I really care about from your side is removing that Added section in the CHANGELOG.md (see my comment).

Otherwise, just need an extra pair of eyes on the advertised SQL query and then this looks good to 🚢

gregfurman · 2026-06-13T11:37:28Z


-	ctx, cancel := context.WithCancel(context.Background())
+	var err error
+	ctx, cancel := context.WithCancelCause(context.Background())


praise: Beautiful 😎

gregfurman · 2026-06-13T11:45:18Z

+### Added
+-  Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105)
+


Suggested change

### Added

- Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105)

Sorry the changelog stuff is quite cryptic. This section will automatically be appended on when we generate the release notes (see https://github.com/hatchet-dev/hatchet/releases/tag/v0.89.0). I'm more concerned with the human-written prose beforehand (which you added above 🙏).

I'm sorry I wasn't aware of that. Will remove it.

gregfurman · 2026-06-13T11:46:57Z

+
+* This update includes an additional cleanup job for the `UserSession` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup:
+
+```sql


@mrkaye97 Can you give this a look-over? Seems fine to me but would like to make sure!

NirajNair · 2026-06-13T12:25:26Z

I've removed Added from CHANGELOG.md and created a new variable userSessionInterval in controller.go as suggested.

github-actions Bot added the engine Related to the core Hatchet engine label Jun 6, 2026

NirajNair force-pushed the fix/cleanup-user-sessions branch from 4f80b91 to 44f19ad Compare June 8, 2026 09:08

gregfurman self-requested a review June 8, 2026 14:45

gregfurman requested changes Jun 8, 2026

View reviewed changes

gregfurman requested changes Jun 9, 2026

View reviewed changes

gregfurman requested changes Jun 12, 2026

View reviewed changes

NirajNair force-pushed the fix/cleanup-user-sessions branch from 44f19ad to cb1b19d Compare June 13, 2026 11:34

gregfurman reviewed Jun 13, 2026

View reviewed changes

fix(engine): add scheduled user session cleanup via retention controller

8f6584a

NirajNair force-pushed the fix/cleanup-user-sessions branch from cb1b19d to 8f6584a Compare June 13, 2026 12:07

		// TestCleanupUserSessions_MultiBatch verifies cleanup works correctly when
		// there are more rows than the batch size (1000), forcing multiple batches.

		### Added
		- Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105)


		* This update includes an additional cleanup job for the `UserSession` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup:

		```sql

Conversation

NirajNair commented Jun 6, 2026

Description

Type of change

What's Changed

Checklist

Uh oh!

vercel Bot commented Jun 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NirajNair Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregfurman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregfurman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NirajNair Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NirajNair commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

NirajNair Jun 12, 2026 •

edited

Loading

NirajNair Jun 13, 2026 •

edited

Loading