Skip to content

Conversation

@michalpristas
Copy link
Contributor

This pull request introduces support for a new "Fleet override" capability, allowing persisted configuration from a file to override configuration received from Fleet, but only if explicitly enabled by capabilities. The changes include the implementation of the new capability type, its integration into the configuration processing flow, and comprehensive unit tests to ensure correct behavior.

Opted for capabilities rather than FF or agent config option as it is clear this is set on purpose. It is not coming from fleet and cannot override itself.

Fleet override capability implementation and integration:

  • Added a new capability type, fleet_override, to the capabilities system, including parsing from YAML, internal representation, and logic to determine if Fleet override is allowed (AllowFleetOverride).
  • Updated the capabilitiesManager to support the new fleetOverrideCaps and wire it into the capability loading process.

Configuration override logic:

  • Implemented the applyPersistedConfig function, which applies persisted configuration from a file on disk if the Fleet override capability is enabled. This function is now called during configuration processing in the coordinator.
  • Added the necessary import for os to support file operations in the coordinator.

Testing:

  • Added unit tests for applyPersistedConfig to verify correct merging of persisted configuration based on the Fleet override capability, including both enabled and disabled scenarios.

Manual testing:

  • For manual testing build and enroll to fleet to a policy with monitoring disabled.

  • Verify http server is not running with curl http://localhost:6774/liveness -I

  • Update elastic-agent.yml with following content

agent:
  monitoring:
    enabled: true
    use_output: default
    logs: false
    metrics: true
    http:
      port: 6774
  • Create capabilities file like so:
capabilities:
- fleet_override:
  rule: allow
  • Restart agent and perform curl again curl http://localhost:6774/liveness -I

@michalpristas michalpristas self-assigned this Dec 2, 2025
@michalpristas michalpristas requested a review from a team as a code owner December 2, 2025 13:49
@michalpristas michalpristas added the enhancement New feature or request label Dec 2, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks fine to me but I don't like where it's located in the codebase. Ideally, this change should try to make reasoning about config loading easier, not harder.

swiatekm
swiatekm previously approved these changes Dec 3, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmacknz
Copy link
Member

cmacknz commented Dec 3, 2025

Capabilities is a nice way to control this, good idea 👍

The capabilities.yml file has to be read from the config path, do we do permissions checks on ownership in this directory? This new capability is quite powerful so I think we want to do some extra work to make sure that the file was not created by an arbitrary user. For example if agent is root, the config path and capabilities.yml file need to be owned by root too.

I don't see any on the capabilities.yml file itself:

func LoadFile(capsFile string, log *logger.Logger) (Capabilities, error) {
// load capabilities from file
fd, err := os.Open(capsFile)
if errors.Is(err, fs.ErrNotExist) {
// No file, return an empty capabilities manager
log.Infof("Capabilities file not found in %s", capsFile)
return &capabilitiesManager{}, nil
}
if err != nil {
return nil, err
}
// We successfully opened the file, pass it through to Load
defer fd.Close()
return Load(fd, log)
}

@michalpristas
Copy link
Contributor Author

michalpristas commented Dec 4, 2025

@swiatekm i'm reverting to previous implementation.
with this one it becomes a mess, we have one enriched config (with overrides that comes from fleet) then we have stored one that is applied on start without enrichments, so server first starts with persisted config and as fleet does not send another update we stick with this one until update, then suddenly we have updated values...

whole patchers is a bit cumbersome, and need more thorough refactoring that i don't want to do in this PR.
to make this work i need patchers applied at loadConfig (very first thing agent do), then i need another patchers that works with config.Config, on cfgManager, different patchers may require different things, as my required capabilities, rootChecks and fleetManaged checks, these things are done again later on. so all in all, it was very not nice

@michalpristas
Copy link
Contributor Author

@cmacknz i added check for capabilities permissions

@cmacknz
Copy link
Member

cmacknz commented Dec 4, 2025

Maybe I have code blindness at the end of the day, but I don't see permissions checks, maybe forgotten push?

@michalpristas
Copy link
Contributor Author

michalpristas commented Dec 5, 2025

no code blindness, i lost the work
just a note that due to fleet initiated privileges level change, we fix permissions at the start to proper user, so i need to do the check before this step. This means we could potentially have very tiny window where capabilities is wrong.
are we worried about elastic-agent config permissions not being checked at all?

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concern is limiting security implications because this new capability is pretty powerful for a Fleet managed agent.

If we can restrict this to only working for containers that would help a lot IMO, so it's scoped to only where we need it, and we don't to worry about arbitrary users on a machine disabling the policy output and similar things via the new override capability.

blakerouse
blakerouse previously approved these changes Dec 10, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. With the feedback from @cmacknz and the added check for being in a container, I think this is a good change and will help a lot with agentless.

cmacknz
cmacknz previously approved these changes Dec 10, 2025
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we may want to log that we are ignoring the fleet override capability though.

@michalpristas michalpristas dismissed stale reviews from blakerouse and cmacknz via e670978 December 11, 2025 15:40
@elasticmachine
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants