Skip to content

Conversation

@rsafonseca
Copy link
Contributor

@rsafonseca rsafonseca commented Jul 25, 2024

What does this PR do?

  • Removes is_leader tag from all metrics: this tag makes no sense to be in the metrics and is causing problems. Only vault leader responds to sys/metrics endpoint, if a standby node is hit, the call is redirected to the leader. It also doesn't make sense to have is_leader as a tag of is_leader metric, as this is redundant
  • Fixes issues asserting if vault has leader or has changed leadership using metrics when vault is using a single address for scraping (e.g. when vault is behind a load balancer and some other situations. is_leader metric is left unchanged to avoid breaking anyone that is using it, but this only works properly in a few scenarios. For all scenarios, a has_leader metric will appropriately report if vault has lost a leader.

Motivation

  • Constant alerts/metric spikes whenever datadog-cluster-agent or datadog-clusterchecks are evicted/restarted, because it often loses track of counters, resetting them instead of sending an incremental monotonic value. This happens due to different timeseries being created, due to is_leader tag being present in all metrics, and no consistency/guarantee for the value returned in a few scenarios, like when vault nodes are behind a load balancer.
  • Inconsistent timing on leader changes (is_leader is set to 0 after a rollout for a random amount of time, depending on how many tries it takes for the leader to get randomly called on the sys/leader endpoint

Additional Notes

N/A

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Changelog entries must be created for modifications to shipped code
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@codecov
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 88.59%. Comparing base (8480233) to head (ee021a9).
Report is 261 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
hive ?
hivemq ?
ignite ?
jboss_wildfly ?
kafka ?
presto ?
solr ?
vault 67.57% <50.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

has_leader_tags = list(self._tags)
has_leader_tags.extend(dynamic_tags)
has_leader_tags.append(f'current_leader:{current_leader}')
submission_queue.append(lambda tags: self.gauge('has_leader', int(has_leader), tags=has_leader_tags))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submission_queue.append(lambda tags: self.gauge('has_leader', int(has_leader), tags=has_leader_tags))
submission_queue.append(lambda tags: self.gauge('vault.has_leader', int(has_leader), tags=has_leader_tags))

Copy link
Contributor Author

@rsafonseca rsafonseca Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveny91 it is working as expected the way it is in the PR, if you see line 89 on the same file you'll notice that the existing is_leader gauge doesn't need the vault. prefix either and it's being submitted the same way :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry it wasn't this one. But the one in vault.py that needed the namespace.

@steveny91
Copy link
Contributor

@rsafonseca Thanks for the contribution. I left a small correction. I'll test the PR a bit later today as the tests are failing.

has_leader_tags = list(self._tags)
has_leader_tags.extend(dynamic_tags)
has_leader_tags.append('current_leader:{}'.format(current_leader))
submission_queue.append(lambda tags: self.gauge('has_leader', int(has_leader), tags=has_leader_tags))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submission_queue.append(lambda tags: self.gauge('has_leader', int(has_leader), tags=has_leader_tags))
submission_queue.append(lambda tags: self.gauge('vault.has_leader', int(has_leader), tags=has_leader_tags))

Following line 153

@steveny91
Copy link
Contributor

steveny91 commented Sep 4, 2024

@rsafonseca Thanks again for the contribution. I think barring that small edit, this can be merged and shipped with the next agent. I created a PR to update the tests that can be merged in after this PR gets merged:
#18497

I'm no expert on vault myself so I was hoping you can provide some additional clarity on the points you made:

if a standby node is hit, the call is redirected to the leader.

Are non leader node(s) always on standby? If not, what is the behavior of nodes that aren't the leader but are active? I found this issue: hashicorp/vault#24401, that somewhat mentions this redirect, but it goes on about standby so hoping to get a better answer here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants