fix(plugins/k8saudit-ovh): add WebSocket reconnect loop and improve resilience#1270
Conversation
|
Welcome @elprofessor-de! It looks like this is your first PR to falcosecurity/plugins 🎉 |
|
Hey @elprofessor-de, thanks for your contribution. I have two preliminary comments:
Thanks! |
7cb31c1 to
9fbd8d0
Compare
9fbd8d0 to
4cddf1f
Compare
|
Done! I've split the commits and added the sign-off. |
irozzo-1A
left a comment
There was a problem hiding this comment.
Thanks for your PR, I suggested a few changes to improve error handling
4cddf1f to
974130f
Compare
|
I guessed you squashed together the commit you spllited 😄 |
974130f to
182f5c7
Compare
|
Sorry about that! |
|
Thanks @elprofessor-de for addressing the review comments, sorry for being picky but there are some formatting changes again in the diff, if you could keep the changes to the bare minimum it would be good for tracking. |
|
Yes... I guess it's IDE setting fault 😅 |
definitely, not @elprofessor-de fault, it's annoying to deal with not well formatted code. We have to enforce formatting check in the CI |
182f5c7 to
3d69ce9
Compare
aeea7e5 to
ac46851
Compare
ac46851 to
e9c2bd6
Compare
|
Applied gofmt — the diff should now be clean. Let me know if anything else needs to be adjusted 🙏 |
| * fix(plugins/k8saudit-ovh): retry dial on initial connection failure instead of exiting | ||
| * fix(plugins/k8saudit-ovh): break inner read loop when ping fails after timeout | ||
| * fix(plugins/k8saudit-ovh): add error handling on json.Unmarshal for log envelope and message payload | ||
| * fix(plugins/k8saudit-ovh): replace magic [12:] slice with dynamic separator-based extraction |
There was a problem hiding this comment.
Thank you for updating also the changelog, but this should be automatically generated by issuing make changelog/k8saudit-ovh on the root folder of the project. Also, since you are updating the changelog, could you please also update the plugin version? The new version should be 0.4.1, since this is a non-breaking fix. You should update it also in plugins/k8saudit-ovh/plugin/main.go
There was a problem hiding this comment.
Good to know about make changelog/k8saudit-ovh! I'll regenerate it properly and bump the version to 0.4.1 in main.go. 🙏
e9c2bd6 to
ffd3c32
Compare
|
Thanks for the PR |
…esilience Signed-off-by: Ilyass KAOUAM <ilyassikai@gmail.com>
…ANGELOG Signed-off-by: Ilyass KAOUAM <ilyassikai@gmail.com>
ffd3c32 to
4bd80c5
Compare
|
Done! Regenerated the CHANGELOG using make changelog/k8saudit-ovh and bumped the version to v0.4.1 in main.go. 🙏 |
scraly
left a comment
There was a problem hiding this comment.
Tested in a MKS cluster in 3AZ Paris, and it's OK 💪
Thanks :)
|
LGTM label has been added. DetailsGit tree hash: b1a3e1849f0965b2fe9f83f521fbe6e9225bd51f |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ekoops, elprofessor-de, irozzo-1A, scraly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Problem
When the LDP WebSocket connection drops (timeout, network blip, server-side close), the plugin logged "Will try to reconnect after 1s..." but actually exited the goroutine without reconnecting. This caused eventC to be closed permanently, and Falco stopped receiving Kubernetes audit events indefinitely — until the pod was manually restarted.
Changes
Reconnect loop: wrapped the dial + read loop in an outer for loop so the connection is re-established automatically after any failure
Dial retry: on initial connection failure, the plugin now retries instead of pushing the error to eventC and exiting
Ping failure handling: if writing a ping fails after a timeout, the inner loop now breaks to trigger a reconnection instead of continuing on a dead connection
JSON error handling: added proper error checking on both json.Unmarshal calls — malformed messages are now logged and skipped cleanly
Magic number removed: replaced the hardcoded [12:] slice (which assumed a fixed-length _appID) with a strings.SplitN on the "> " separator — prevents potential panics on short strings
Testing
Tested against an OVHcloud MKS cluster with intentional WebSocket disconnections — the plugin reconnects automatically and resumes event ingestion without restarting the Falco pod.