Skip to content

fix(api): add cron job to clean up UserSession#4105

Open
NirajNair wants to merge 1 commit into
hatchet-dev:mainfrom
NirajNair:fix/cleanup-user-sessions
Open

fix(api): add cron job to clean up UserSession#4105
NirajNair wants to merge 1 commit into
hatchet-dev:mainfrom
NirajNair:fix/cleanup-user-sessions

Conversation

@NirajNair

Copy link
Copy Markdown

Description

Introduces cron scheduler in the API layer and a cron job that runs every hour deleting expired and orphaned UserSession rows addressing the unbounded table growth issue reported in #3913.

Based on the ~1 row/sec insertion rate reported in the issue, approx. 3600 rows accumulate per hour - hence choosing 1-hour cron interval that balances timely cleanup.

As per this comment, UserSession rows are deleted when:

  • expiresAt < NOW() OR
  • userId = NULL AND createdAt < NOW() - 24 hours

Follow up fix for issue #3913

Type of change

  • Bug fix (non-breaking change which fixes an issue)

What's Changed

  • Added a cron scheduler and a job in API layer that runs every 1 hour to delete expired and orphaned UserSession

Checklist

  • Tested (unit, integration, or manually with steps specified)
  • Linted and formatted
  • Documented (where applicable)
  • Added to CHANGELOG (where applicable) -- see Keep a Changelog

🤖 AI Disclosure
  • I acknowledge that an LLM was used in the creation of this Pull Request, in accordance with Hatchet's AI_POLICY.md.

  • Details: I've used Pi coding agent (model: Minimax M2.7) to understand repo structure, code patterns, compare and align implementation with existing cron job patterns, verify naming conventions, and generate tests.

@vercel

vercel Bot commented Jun 6, 2026

Copy link
Copy Markdown

@NirajNair is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added the engine Related to the core Hatchet engine label Jun 6, 2026
@NirajNair NirajNair force-pushed the fix/cleanup-user-sessions branch from 4f80b91 to 44f19ad Compare June 8, 2026 09:08
@gregfurman gregfurman self-requested a review June 8, 2026 14:45
Comment thread pkg/repository/user_session_test.go Outdated
Comment on lines +60 to +62
// TestCleanupUserSessions verifies the cleanup logic handles both conditions:
// 1. Expired sessions (expiresAt < NOW())
// 2. Unauthenticated sessions (userId IS NULL) older than 24 hours

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all of this inline documentation?

Comment thread pkg/repository/user_session_test.go
Comment thread api/v1/server/cron/cleanup.go Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I'm wondering if this implementation should instead be in internal/services/controllers/retention instead of at the api layer -- especially since this would allow cleanups to respect a users data retention settings. I'm not sold on either approach so would be curious to hear your thoughts!

cc @abelanger5

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair

@NirajNair NirajNair Jun 12, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree because I think internal/services/controllers/retention already has a cron scheduler and the tasks that it runs looks like maintenance tasks, similar to this one. Also, we won't be introducing a new scheduler in the API layer at all if we move.

Comment thread pkg/repository/user_session.go Outdated
Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)

CleanupUserSessions(ctx context.Context) (int, error)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure we really need the no. of sessions cleaned up as a return type. Is this just for logging?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its just for logging.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just for logging then I think it's safe to just replace it with a single error type.

Comment on lines +177 to +179
if err != nil {
return totalDeleted, err
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no sessions have expired (and no data is found), we're going to be returning a false positive pgx.ErrNoRows type -- which would be logged as an error etc.

Think we should be guarding against this with a errors.Is(err, pgx.ErrNoRows) check that returns a nil error if true.

Comment thread api/v1/server/cron/scheduler.go Outdated
}

func (c *CronScheduler) Start() (func() error, error) {
ctx, cancel := context.WithCancel(context.Background())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could have a var err error defined with a defer e.g.

var err error

defer func() {
   if err != nil {
      cancel()
   }
}()

That way, we automatically cancel the operation if err an error is returned.

Comment thread api/v1/server/cron/scheduler.go Outdated
}

func (c *CronScheduler) Start() (func() error, error) {
ctx, cancel := context.WithCancel(context.Background())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: Although we don't do this anywhere else, could be a nice opportunity to use a context.WithCancelCause function and pass the err as the reason 👀

Comment thread pkg/repository/user_session_test.go Outdated
Comment on lines +179 to +180
// TestCleanupUserSessions_MultiBatch verifies cleanup works correctly when
// there are more rows than the batch size (1000), forcing multiple batches.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also these comments. They really just pollute the diff

Comment thread pkg/repository/sqlcv1/users.sql Outdated
LIMIT @batchSize::int
)
DELETE FROM "UserSession"
WHERE "id" IN (SELECT "id" FROM sessions_to_delete);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we not just do a DELETE FROM? Why do we need to collect the sessions beforehand?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup cron deletes rows in batches of 1000 and for that we need LIMIT clause but DELETE (similar to UPDATE) does not support LIMIT clause (PostgresSQL Docs). I believe ORDER BY is not supported either as I couldn't find it in the synopsis of both operations.

PostgresSQL docs recommend to use CTE for batch deletes/updates. Also, this query pattern already exists in the repo, for eg. in CleanupV1QueueItem query.

One improvement that we can do:
change:
DELETE FROM "UserSession" WHERE "id" IN (SELECT "id" FROM sessions_to_delete);

to:
DELETE FROM "UserSession" AS us USING sessions_to_delete AS s WHERE us."id" = s."id";

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.

Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.

From the original issue:

In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.

SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL)
FROM "UserSession";
-- 2,604,487 anonymous   |   9 with userId (4 real users)

Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.

If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.

Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.

Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.

@abelanger5 would love your input here!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK chatted internally. Think this approach as-is is fine. However, we should add something to the release notes/changelog (in CHANGELOG.md) about the drop invalid approach since there are currently users with a large backlog of undeleted rows.

Do you have capacity to add a section to the CHANGELOG.md on this? I'm thinking something like:

## [Unreleased]

### Upgrade Notes

* This update includes an additional cleanup job for the `UserSessions` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup:

```sql
-- creates a tmp_UserSessions from only valid UserSession rows, 
-- drops the whole UserSessions table, renames the tmp_UserSessions
-- to UserSessions.
```

(Unreleased may not exist... if it doesn't we should add it)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update CHANGELOG.md.

Quick Q: should we also have an Added section under Unreleased?

Comment thread pkg/repository/sqlcv1/users.sql Outdated
Comment thread pkg/repository/user_session.go Outdated
Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)
GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error)

CleanupUserSessions(ctx context.Context) (int, error)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just for logging then I think it's safe to just replace it with a single error type.

Comment thread pkg/repository/sqlcv1/users.sql Outdated
Comment thread pkg/repository/sqlcv1/users.sql Outdated
AND "createdAt" < NOW() - INTERVAL '24 hours'
)
ORDER BY "createdAt" ASC
LIMIT @batchSize::int

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Wonder if we should be using a FOR UPDATE SKIP LOCKED here to prevent data races 🤔 Or is it fine since we don't expect to be concurrently writing to the same rows only reading?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will cause any data races but using FOR UPDATE SKIP LOCKED will make the job efficient by letting other API instances SELECT rows that are not already locked by some other instance.

Comment thread pkg/repository/sqlcv1/users.sql Outdated
LIMIT @batchSize::int
)
DELETE FROM "UserSession"
WHERE "id" IN (SELECT "id" FROM sessions_to_delete);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.

Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.

From the original issue:

In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.

SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL)
FROM "UserSession";
-- 2,604,487 anonymous   |   9 with userId (4 real users)

Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.

If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.

Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.

Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.

@abelanger5 would love your input here!

@gregfurman gregfurman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NirajNair Have left 2 more comments. I think this should move to the internal/services/controllers/retention since that means we can control it via the user-configured data retention settings. Also, we should definitely be adding something to the CHANGELOG.md on this.

Looking forward to getting this merged 🙌

Comment thread api/v1/server/cron/cleanup.go Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair

@NirajNair NirajNair force-pushed the fix/cleanup-user-sessions branch from 44f19ad to cb1b19d Compare June 13, 2026 11:34

@gregfurman gregfurman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Left some non-blocking comments. Only actionable thing I really care about from your side is removing that Added section in the CHANGELOG.md (see my comment).

Otherwise, just need an extra pair of eyes on the advertised SQL query and then this looks good to 🚢


ctx, cancel := context.WithCancel(context.Background())
var err error
ctx, cancel := context.WithCancelCause(context.Background())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Beautiful 😎

Comment thread internal/services/controllers/retention/controller.go Outdated
Comment thread CHANGELOG.md Outdated
Comment on lines +27 to +29
### Added
- Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Added
- Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105)

Sorry the changelog stuff is quite cryptic. This section will automatically be appended on when we generate the release notes (see https://github.com/hatchet-dev/hatchet/releases/tag/v0.89.0). I'm more concerned with the human-written prose beforehand (which you added above 🙏).

@NirajNair NirajNair Jun 13, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I wasn't aware of that. Will remove it.

Comment thread CHANGELOG.md

* This update includes an additional cleanup job for the `UserSession` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup:

```sql

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrkaye97 Can you give this a look-over? Seems fine to me but would like to make sure!

Comment thread pkg/repository/user_session_test.go
@NirajNair NirajNair force-pushed the fix/cleanup-user-sessions branch from cb1b19d to 8f6584a Compare June 13, 2026 12:07
@NirajNair

Copy link
Copy Markdown
Author

I've removed Added from CHANGELOG.md and created a new variable userSessionInterval in controller.go as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine Related to the core Hatchet engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants