Data collection and graph definitions for provisiond performance

Description

Provisiond metrics are exposed via JMX since NMS-10358, but there are no datacollection or graph definitions to make them accessible.
I'd like to be able to collect and graph general performance information like:

  • min/max/mean/median time to provision a node, or all nodes for a location

  • rate at which nodes are provisioned

Acceptance / Success Criteria

None

Activity

Show:

Alberto January 24, 2023 at 2:47 PM

Merged to foundation-2023

Jeff Gehlbach January 10, 2023 at 7:41 PM

Pushing Horizon fix version from 31.0.3 to 31.0.4 since there still appears to be live discussion about this one.

Will Keaney January 5, 2023 at 6:15 PM

, you’ve described exactly what I’m looking for. I want a way to view both overall provisiond performance over time, and to drill down into specific jobs when a particular location’s import is behaving unexpectedly.

Dino Yancey January 5, 2023 at 5:43 PM

I think the original (pre-rewrite) intent of the TimeTrackingMonitor was to track and log metrics per provisiond “import job” so i think in that respect the current TimeTrackingMonitor is fine, it’s just not useful beyond single job scale or easily collectable by the JMXCollector. If it is possible to extend it with a per-requisition histogram of time to import a requisition (a sum of all lifecycle stops long the way), independent of the “single import job” lifecycle, so we can do something like:

<mbean name="org.opennms.netmgt.provision.requisitions" resource-type="reqMetrics" objectname="org.opennms.netmgt.provision.requisitions:name=*"> <attrib name="50thPercentile" alias="ReqPerf50" type="gauge"/> <attrib name="75thPercentile" alias="ReqPerf75" type="gauge"/> <attrib name="95thPercentile" alias="ReqPerf95" type="gauge"/> <attrib name="98thPercentile" alias="ReqPerf98" type="gauge"/> <attrib name="99thPercentile" alias="ReqPerf99" type="gauge"/> <attrib name="999thPercentile" alias="ReqPerf999" type="gauge"/> <attrib name="Max" alias="ReqPerfMax" type="gauge"/> <attrib name="Min" alias="ReqPerfMin" type="gauge"/> <attrib name="Count" alias="ReqPerfCounter" type="counter"/> </mbean> <resourceType name="reqMetrics" label="Requisition Metrics" resourceLabel="${index}"> <persistenceSelectorStrategy class="org.opennms.netmgt.collection.support.PersistAllSelectorStrategy"/> <storageStrategy class="org.opennms.netmgt.dao.support.SiblingColumnStorageStrategy"> <parameter key="sibling-column-name" value="Name" /> </storageStrategy> </resourceType>

…and end up with an index of requisitions by name with metrics for each that tracks the total time start to finish to import, load, schedule, scan, audit, relate, event, and persist each requisition, I think that will make everybody happy. In a nutshell, keep the Provisiond metrics you already added so users can see how the daemon as a whole is performing, but also offer per-requisition granularity for import lifecycle duration so users can see where provisiond is spending it’s time.

What you’ve added in PR#5624 is great and is better aligned with tracking Provisiond-wide performance over time across multiple runs, and I’d really like to see it in a release.

I don’t know if the existing provisiond metrics added in PR#4307 are actually useful. I have no preference on whether they’re refactored out into something more useful, or left in place because they could potentially be useful in some context eventually. Generally, I think more data is better.

Could you also weigh in ?

Alberto January 5, 2023 at 4:36 PM

regarding your last comment for the node provisioning metrics (mentioned in the description). I was trying to add a new set of monitoring stats that whould not expire and would not have a timestamp associated,

But now I’m thinking if you wanted to refactor the TimetrackingMonitor and remove the expiration date and the timestamp for those metrics and use node location instead as the metric grouping?

But then again they might be useful the way they currently are… I think have them both sound like a reasonable option to me but would like an opinion to confirm.

Fixed

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Story Points

Components

Sprint

Affects versions

Priority

PagerDuty

Created November 11, 2022 at 6:47 PM
Updated February 7, 2023 at 2:08 PM
Resolved January 24, 2023 at 2:47 PM