fix(api): add cron job to clean up UserSession#4105
Conversation
|
@NirajNair is attempting to deploy a commit to the Hatchet Team on Vercel. A member of the Team first needs to authorize it. |
4f80b91 to
44f19ad
Compare
| // TestCleanupUserSessions verifies the cleanup logic handles both conditions: | ||
| // 1. Expired sessions (expiresAt < NOW()) | ||
| // 2. Unauthenticated sessions (userId IS NULL) older than 24 hours |
There was a problem hiding this comment.
Can we remove all of this inline documentation?
There was a problem hiding this comment.
Q: I'm wondering if this implementation should instead be in internal/services/controllers/retention instead of at the api layer -- especially since this would allow cleanups to respect a users data retention settings. I'm not sold on either approach so would be curious to hear your thoughts!
cc @abelanger5
There was a problem hiding this comment.
I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair
There was a problem hiding this comment.
I agree because I think internal/services/controllers/retention already has a cron scheduler and the tasks that it runs looks like maintenance tasks, similar to this one. Also, we won't be introducing a new scheduler in the API layer at all if we move.
| Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error) | ||
| GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error) | ||
|
|
||
| CleanupUserSessions(ctx context.Context) (int, error) |
There was a problem hiding this comment.
Unsure we really need the no. of sessions cleaned up as a return type. Is this just for logging?
There was a problem hiding this comment.
If it's just for logging then I think it's safe to just replace it with a single error type.
| if err != nil { | ||
| return totalDeleted, err | ||
| } |
There was a problem hiding this comment.
If no sessions have expired (and no data is found), we're going to be returning a false positive pgx.ErrNoRows type -- which would be logged as an error etc.
Think we should be guarding against this with a errors.Is(err, pgx.ErrNoRows) check that returns a nil error if true.
| } | ||
|
|
||
| func (c *CronScheduler) Start() (func() error, error) { | ||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
nit: we could have a var err error defined with a defer e.g.
var err error
defer func() {
if err != nil {
cancel()
}
}()That way, we automatically cancel the operation if err an error is returned.
| } | ||
|
|
||
| func (c *CronScheduler) Start() (func() error, error) { | ||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
super-nit: Although we don't do this anywhere else, could be a nice opportunity to use a context.WithCancelCause function and pass the err as the reason 👀
| // TestCleanupUserSessions_MultiBatch verifies cleanup works correctly when | ||
| // there are more rows than the batch size (1000), forcing multiple batches. |
There was a problem hiding this comment.
Also these comments. They really just pollute the diff
| LIMIT @batchSize::int | ||
| ) | ||
| DELETE FROM "UserSession" | ||
| WHERE "id" IN (SELECT "id" FROM sessions_to_delete); |
There was a problem hiding this comment.
Why can we not just do a DELETE FROM? Why do we need to collect the sessions beforehand?
There was a problem hiding this comment.
Cleanup cron deletes rows in batches of 1000 and for that we need LIMIT clause but DELETE (similar to UPDATE) does not support LIMIT clause (PostgresSQL Docs). I believe ORDER BY is not supported either as I couldn't find it in the synopsis of both operations.
PostgresSQL docs recommend to use CTE for batch deletes/updates. Also, this query pattern already exists in the repo, for eg. in CleanupV1QueueItem query.
One improvement that we can do:
change:
DELETE FROM "UserSession" WHERE "id" IN (SELECT "id" FROM sessions_to_delete);
to:
DELETE FROM "UserSession" AS us USING sessions_to_delete AS s WHERE us."id" = s."id";
There was a problem hiding this comment.
Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.
Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.
From the original issue:
In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.
SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL) FROM "UserSession"; -- 2,604,487 anonymous | 9 with userId (4 real users)
Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.
If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.
Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.
Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.
@abelanger5 would love your input here!
There was a problem hiding this comment.
OK chatted internally. Think this approach as-is is fine. However, we should add something to the release notes/changelog (in CHANGELOG.md) about the drop invalid approach since there are currently users with a large backlog of undeleted rows.
Do you have capacity to add a section to the CHANGELOG.md on this? I'm thinking something like:
## [Unreleased]
### Upgrade Notes
* This update includes an additional cleanup job for the `UserSessions` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup:
```sql
-- creates a tmp_UserSessions from only valid UserSession rows,
-- drops the whole UserSessions table, renames the tmp_UserSessions
-- to UserSessions.
```
(Unreleased may not exist... if it doesn't we should add it)
There was a problem hiding this comment.
Sure, will update CHANGELOG.md.
Quick Q: should we also have an Added section under Unreleased?
| Delete(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error) | ||
| GetById(ctx context.Context, sessionId uuid.UUID) (*sqlcv1.UserSession, error) | ||
|
|
||
| CleanupUserSessions(ctx context.Context) (int, error) |
There was a problem hiding this comment.
If it's just for logging then I think it's safe to just replace it with a single error type.
| AND "createdAt" < NOW() - INTERVAL '24 hours' | ||
| ) | ||
| ORDER BY "createdAt" ASC | ||
| LIMIT @batchSize::int |
There was a problem hiding this comment.
Q: Wonder if we should be using a FOR UPDATE SKIP LOCKED here to prevent data races 🤔 Or is it fine since we don't expect to be concurrently writing to the same rows only reading?
There was a problem hiding this comment.
I don't think it will cause any data races but using FOR UPDATE SKIP LOCKED will make the job efficient by letting other API instances SELECT rows that are not already locked by some other instance.
| LIMIT @batchSize::int | ||
| ) | ||
| DELETE FROM "UserSession" | ||
| WHERE "id" IN (SELECT "id" FROM sessions_to_delete); |
There was a problem hiding this comment.
Thanks for clarifying! Yeah the logic does make sense. I suppose I'm just wondering if this batching is really solving the problem that we think it is.
Like, my main concern is that we're only ever really expect a very large cleanup call on first run, since users may have accumulated millions of rows. Unless someone is really abusing new session creation, I wouldn't expect anywhere close to this number following the previous fix + introduction of cleanups.
From the original issue:
In our self-hosted Hatchet deployment, the UserSession table in the Hatchet Postgres instance has grown to 2.6M rows (~1 GB) over ~30 days.
SELECT count(*) FILTER (WHERE "userId" IS NULL), count(*) FILTER (WHERE "userId" IS NOT NULL) FROM "UserSession"; -- 2,604,487 anonymous | 9 with userId (4 real users)
Perhaps better put: if someone has accumulated millions of rows (worst case), I'm unsure a batch size of 1000 is going to cut it. Otherwise, I see the average and best cases as being very similar in magnitude --that is, 1000 could be a HUGE overshoot when we only ever expect a couple of dozen rows.
If we assume this is the average case, that'll mean 2600 individual calls being made within a loop that doesn't really have backoff or jittering between each operation.
Like, at this point it's probably cheaper to just re-create the entire table or maybe just accept that the first call is going to be expensive.
Maybe it could be worthwhile to include a DB migration that recreates the table with valid session keys instead of relying on the cron approach exclusively.
@abelanger5 would love your input here!
gregfurman
left a comment
There was a problem hiding this comment.
Hey @NirajNair Have left 2 more comments. I think this should move to the internal/services/controllers/retention since that means we can control it via the user-configured data retention settings. Also, we should definitely be adding something to the CHANGELOG.md on this.
Looking forward to getting this merged 🙌
There was a problem hiding this comment.
I've given this some more thought and I think we should be adding this as a cleanup to the internal/services/controllers/retention instead. Wdyt? @NirajNair
44f19ad to
cb1b19d
Compare
gregfurman
left a comment
There was a problem hiding this comment.
This is looking great! Left some non-blocking comments. Only actionable thing I really care about from your side is removing that Added section in the CHANGELOG.md (see my comment).
Otherwise, just need an extra pair of eyes on the advertised SQL query and then this looks good to 🚢
|
|
||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| var err error | ||
| ctx, cancel := context.WithCancelCause(context.Background()) |
| ### Added | ||
| - Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105) | ||
|
|
There was a problem hiding this comment.
| ### Added | |
| - Scheduled user session cleanup via retention controller by @NirajNair in [#4105](https://github.com/hatchet-dev/hatchet/pull/4105) |
Sorry the changelog stuff is quite cryptic. This section will automatically be appended on when we generate the release notes (see https://github.com/hatchet-dev/hatchet/releases/tag/v0.89.0). I'm more concerned with the human-written prose beforehand (which you added above 🙏).
There was a problem hiding this comment.
I'm sorry I wasn't aware of that. Will remove it.
|
|
||
| * This update includes an additional cleanup job for the `UserSession` table to correctly remove expired/invalid user sessions from the table -- addressing unbounded growth. However, we recommend running the following query against your Postgres instance to ensure a bulk-cleanup: | ||
|
|
||
| ```sql |
There was a problem hiding this comment.
@mrkaye97 Can you give this a look-over? Seems fine to me but would like to make sure!
cb1b19d to
8f6584a
Compare
|
I've removed |
Description
Introduces cron scheduler in the API layer and a cron job that runs every hour deleting expired and orphaned
UserSessionrows addressing the unbounded table growth issue reported in #3913.Based on the ~1 row/sec insertion rate reported in the issue, approx. 3600 rows accumulate per hour - hence choosing 1-hour cron interval that balances timely cleanup.
As per this comment,
UserSessionrows are deleted when:expiresAt < NOW()ORuserId = NULL AND createdAt < NOW() - 24 hoursFollow up fix for issue #3913
Type of change
What's Changed
Checklist
🤖 AI Disclosure
I acknowledge that an LLM was used in the creation of this Pull Request, in accordance with Hatchet's AI_POLICY.md.
Details: I've used Pi coding agent (model: Minimax M2.7) to understand repo structure, code patterns, compare and align implementation with existing cron job patterns, verify naming conventions, and generate tests.