Data collection and graph definitions for provisiond performance
Description
Acceptance / Success Criteria
Activity
Alberto January 24, 2023 at 2:47 PM
Merged to foundation-2023
Jeff Gehlbach January 10, 2023 at 7:41 PM
Pushing Horizon fix version from 31.0.3 to 31.0.4 since there still appears to be live discussion about this one.
Will Keaney January 5, 2023 at 6:15 PM
@Dino Yancey , you’ve described exactly what I’m looking for. I want a way to view both overall provisiond performance over time, and to drill down into specific jobs when a particular location’s import is behaving unexpectedly.
Dino Yancey January 5, 2023 at 5:43 PM
I think the original (pre-rewrite) intent of the TimeTrackingMonitor was to track and log metrics per provisiond “import job” so i think in that respect the current TimeTrackingMonitor is fine, it’s just not useful beyond single job scale or easily collectable by the JMXCollector. If it is possible to extend it with a per-requisition histogram of time to import a requisition (a sum of all lifecycle stops long the way), independent of the “single import job” lifecycle, so we can do something like:
<mbean name="org.opennms.netmgt.provision.requisitions" resource-type="reqMetrics" objectname="org.opennms.netmgt.provision.requisitions:name=*">
<attrib name="50thPercentile" alias="ReqPerf50" type="gauge"/>
<attrib name="75thPercentile" alias="ReqPerf75" type="gauge"/>
<attrib name="95thPercentile" alias="ReqPerf95" type="gauge"/>
<attrib name="98thPercentile" alias="ReqPerf98" type="gauge"/>
<attrib name="99thPercentile" alias="ReqPerf99" type="gauge"/>
<attrib name="999thPercentile" alias="ReqPerf999" type="gauge"/>
<attrib name="Max" alias="ReqPerfMax" type="gauge"/>
<attrib name="Min" alias="ReqPerfMin" type="gauge"/>
<attrib name="Count" alias="ReqPerfCounter" type="counter"/>
</mbean>
<resourceType name="reqMetrics" label="Requisition Metrics" resourceLabel="${index}">
<persistenceSelectorStrategy class="org.opennms.netmgt.collection.support.PersistAllSelectorStrategy"/>
<storageStrategy class="org.opennms.netmgt.dao.support.SiblingColumnStorageStrategy">
<parameter key="sibling-column-name" value="Name" />
</storageStrategy>
</resourceType>
…and end up with an index of requisitions by name with metrics for each that tracks the total time start to finish to import, load, schedule, scan, audit, relate, event, and persist each requisition, I think that will make everybody happy. In a nutshell, keep the Provisiond metrics you already added so users can see how the daemon as a whole is performing, but also offer per-requisition granularity for import lifecycle duration so users can see where provisiond is spending it’s time.
What you’ve added in PR#5624 is great and is better aligned with tracking Provisiond-wide performance over time across multiple runs, and I’d really like to see it in a release.
I don’t know if the existing provisiond metrics added in PR#4307 are actually useful. I have no preference on whether they’re refactored out into something more useful, or left in place because they could potentially be useful in some context eventually. Generally, I think more data is better.
@Will Keaney Could you also weigh in ?
Alberto January 5, 2023 at 4:36 PM
@Dino Yancey regarding your last comment for the node provisioning metrics (mentioned in the description). I was trying to add a new set of monitoring stats that whould not expire and would not have a timestamp associated,
But now I’m thinking if you wanted to refactor the TimetrackingMonitor and remove the expiration date and the timestamp for those metrics and use node location instead as the metric grouping?
But then again they might be useful the way they currently are… I think have them both sound like a reasonable option to me but would like an opinion to confirm.
Details
Assignee
AlbertoAlbertoReporter
Will KeaneyWill KeaneyHB Grooming Date
Nov 21, 2022HB Backlog Status
Refined BacklogStory Points
6Components
Sprint
NoneFix versions
Affects versions
Priority
Minor
Details
Details
Assignee
Reporter
HB Grooming Date
HB Backlog Status
Story Points
Components
Sprint
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Provisiond metrics are exposed via JMX since NMS-10358, but there are no datacollection or graph definitions to make them accessible.
I'd like to be able to collect and graph general performance information like:
min/max/mean/median time to provision a node, or all nodes for a location
rate at which nodes are provisioned