Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Dec 10, 2025

What does this PR do?

At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.

This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.

Why is it important?

The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test(covered by unit tests)

Disruptive User Impact

None

How to test this PR locally

Use either an invalid OTEL configuration or one that will fail to start. Observer that when running elastic-agent run with that OTEL configuration in the elastic-agent.yml filed (aka. Hybrid Mode) that elastic-agent status --output=full provides correct state information.

Related issues


This is an automatic backport of pull request #11448 done by [Mergify](https://mergify.com).

…1448)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)
@mergify mergify bot added the backport label Dec 10, 2025
@mergify mergify bot requested a review from a team as a code owner December 10, 2025 13:35
@mergify mergify bot requested review from blakerouse and michel-laterman and removed request for a team December 10, 2025 13:35
@mergify mergify bot added the backport label Dec 10, 2025
@github-actions github-actions bot added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Dec 10, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

cc @blakerouse

@blakerouse blakerouse merged commit 35090b0 into 9.2 Dec 11, 2025
21 checks passed
@blakerouse blakerouse deleted the mergify/bp/9.2/pr-11448 branch December 11, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants