Data collection and graph definitions for provisiond performance

Description

Provisiond metrics are exposed via JMX since NMS-10358, but there are no datacollection or graph definitions to make them accessible.
I'd like to be able to collect and graph general performance information like:

min/max/mean/median time to provision a node, or all nodes for a location
rate at which nodes are provisioned

Acceptance / Success Criteria

None

Activity

Show:

Alberto January 24, 2023 at 2:47 PM

Merged to foundation-2023

Jeff Gehlbach January 10, 2023 at 7:41 PM

Pushing Horizon fix version from 31.0.3 to 31.0.4 since there still appears to be live discussion about this one.

Will Keaney January 5, 2023 at 6:15 PM

@Dino Yancey , you’ve described exactly what I’m looking for. I want a way to view both overall provisiond performance over time, and to drill down into specific jobs when a particular location’s import is behaving unexpectedly.

Dino Yancey January 5, 2023 at 5:43 PM

I think the original (pre-rewrite) intent of the TimeTrackingMonitor was to track and log metrics per provisiond “import job” so i think in that respect the current TimeTrackingMonitor is fine, it’s just not useful beyond single job scale or easily collectable by the JMXCollector. If it is possible to extend it with a per-requisition histogram of time to import a requisition (a sum of all lifecycle stops long the way), independent of the “single import job” lifecycle, so we can do something like:

    <mbean name="org.opennms.netmgt.provision.requisitions" resource-type="reqMetrics" objectname="org.opennms.netmgt.provision.requisitions:name=*">
       <attrib name="50thPercentile" alias="ReqPerf50" type="gauge"/>
       <attrib name="75thPercentile" alias="ReqPerf75" type="gauge"/>
       <attrib name="95thPercentile" alias="ReqPerf95" type="gauge"/>
       <attrib name="98thPercentile" alias="ReqPerf98" type="gauge"/>
       <attrib name="99thPercentile" alias="ReqPerf99" type="gauge"/>
       <attrib name="999thPercentile" alias="ReqPerf999" type="gauge"/>
       <attrib name="Max" alias="ReqPerfMax" type="gauge"/>
       <attrib name="Min" alias="ReqPerfMin" type="gauge"/>
       <attrib name="Count" alias="ReqPerfCounter" type="counter"/>
    </mbean>
         
    <resourceType name="reqMetrics" label="Requisition Metrics" resourceLabel="${index}">
      <persistenceSelectorStrategy class="org.opennms.netmgt.collection.support.PersistAllSelectorStrategy"/>
      <storageStrategy class="org.opennms.netmgt.dao.support.SiblingColumnStorageStrategy">
        <parameter key="sibling-column-name" value="Name" />
      </storageStrategy>
    </resourceType>

…and end up with an index of requisitions by name with metrics for each that tracks the total time start to finish to import, load, schedule, scan, audit, relate, event, and persist each requisition, I think that will make everybody happy. In a nutshell, keep the Provisiond metrics you already added so users can see how the daemon as a whole is performing, but also offer per-requisition granularity for import lifecycle duration so users can see where provisiond is spending it’s time.

What you’ve added in PR#5624 is great and is better aligned with tracking Provisiond-wide performance over time across multiple runs, and I’d really like to see it in a release.

I don’t know if the existing provisiond metrics added in PR#4307 are actually useful. I have no preference on whether they’re refactored out into something more useful, or left in place because they could potentially be useful in some context eventually. Generally, I think more data is better.

@Will Keaney Could you also weigh in ?

Alberto January 5, 2023 at 4:36 PM

@Dino Yancey regarding your last comment for the node provisioning metrics (mentioned in the description). I was trying to add a new set of monitoring stats that whould not expire and would not have a timestamp associated,

But now I’m thinking if you wanted to refactor the TimetrackingMonitor and remove the expiration date and the timestamp for those metrics and use node location instead as the metric grouping?

But then again they might be useful the way they currently are… I think have them both sound like a reasonable option to me but would like an opinion to confirm.

Fixed

Details
Assignee
Alberto
Reporter
Will Keaney
HB Grooming Date
Nov 21, 2022
HB Backlog Status
Refined Backlog
Story Points
6
Components
Collectd
Sprint
None
Fix versions
Meridian-2023.1.0
31.0.4
Affects versions
31.0.0
Priority
Minor
Parent
NMS-15016 H32 Supportability Improvements

PagerDuty

Created November 11, 2022 at 6:47 PM

Updated February 7, 2023 at 2:08 PM

Resolved January 24, 2023 at 2:47 PM

Configure

Data collection and graph definitions for provisiond performance

Description

Acceptance / Success Criteria

Activity

Alberto January 24, 2023 at 2:47 PM

Jeff Gehlbach January 10, 2023 at 7:41 PM

Will Keaney January 5, 2023 at 6:15 PM

Dino Yancey January 5, 2023 at 5:43 PM

Alberto January 5, 2023 at 4:36 PM

DetailsAssigneeAlbertoAlbertoReporterWill KeaneyWill KeaneyHB Grooming DateNov 21, 2022HB Backlog StatusRefined BacklogStory Points6ComponentsCollectdSprintNone+1Fix versionsMeridian-2023.1.031.0.4Affects versions31.0.0PriorityMinorParentNMS-15016 H32 Supportability Improvements

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Story Points

Components

Sprint

Fix versions

Affects versions

Priority

Parent

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Alberto
Reporter
Will Keaney
HB Grooming Date
Nov 21, 2022
HB Backlog Status
Refined Backlog
Story Points
6
Components
Collectd
Sprint
None
Fix versions
Meridian-2023.1.0
31.0.4
Affects versions
31.0.0
Priority
Minor
Parent
NMS-15016 H32 Supportability Improvements

PagerDuty