Notifications Not Auto-Acking

Description

I have some services that are having frequent, small SNMP outages (down for ~30 seconds), and not all of the notifications are getting auto-acked. So, despite having a 2m delay on the notification path, I'm getting notified about issues that are already resolved, and not ever being told that they are resolved.

An example of a recent outage's timing:

Lost¬†Service¬†Time	2020-07-08T10:05:36-04:00

Regained¬†Service¬†Time	2020-07-08T10:06:07-04:00

But, the notification related to that (linked to the same outage event):

Notification¬†Time	2020-07-08T10:05:37-04:00	Time¬†Replied	¬†

Users Notified

Sent To	Sent At	Media	Contact Info
mkelly	2020-07-08T10:07:51-04:00	javaEmail	¬†

So, it didn't get acked at 10:06:07, so I got an email at 10:07:51, even though the outage was already resolved.

—

My config for notifications:

Using auto-acknowledge-alarm:

<auto-acknowledge-alarm resolution-prefix="RESOLVED: ">
      <uei>uei.opennms.org/nodes/serviceResponsive</uei>
      <uei>uei.opennms.org/nodes/nodeRegainedService</uei>
      <uei>uei.opennms.org/nodes/interfaceUp</uei>
      <uei>uei.opennms.org/nodes/nodeUp</uei>
      <uei>uei.opennms.org/correlation/remote/wideSpreadOutageResolved</uei>
      <!-- omit a few custom alarms for our environment -->
      <uei>uei.opennms.org/threshold/highThresholdRearmed</uei>
      <uei>uei.opennms.org/threshold/lowThresholdRearmed</uei>
      <uei>uei.opennms.org/internal/importer/importSuccessful</uei>
   </auto-acknowledge-alarm>

Default queue handler stuff:

<queue>
      <queue-id>default</queue-id>
      <interval>20s</interval>
      <handler-class>
         <name>org.opennms.netmgt.notifd.DefaultQueueHandler</name>
      </handler-class>
   </queue>

Destination path:

<path name="Email-Servers" initial-delay="2m">
      <target>
         <name>Servers_OnCall</name>
         <autoNotify>on</autoNotify>
         <command>javaEmail</command>
      </target>
   </path>

—

Right now, nothing shows up for this whole month in notifd.log (I assume the default logging level isn't going to show me anything), but I do see this in alarmd for the specific alarm in question:

2020-07-08 10:10:31,613 WARN [alarmd-Thread-4-of-4] o.o.n.a.d.DroolsAlarmContext: Failed to acquire Drools session lock within 20000ms. Add or update for alarm with id=6059035
and reduction-key=uei.opennms.org/nodes/nodeLostService::2751:10.xx.xx.xx:SNMP will not be immediately reflected in the context.

Environment

RHEL 7, OpenJDK 8, PostgreSQL 9.2

Acceptance / Success Criteria

None

Confluence content

mentioned on

https://confluence.internal.opennms.com/pages/viewpage.action?pageId=43057822

Lucidchart Diagrams

Activity

Mike Kelly January 6, 2021 at 12:36 PM

Updated event config to discard noisy traps.

Mike Kelly January 6, 2021 at 12:35 PM

I think the root cause here was that we had so many incoming traps at one point that OpenNMS was overwhelmed. Judicious use of discard trap event config fixed this for us.

Sandy Skipper July 21, 2020 at 3:07 PM

@Mike Kelly Please provide specific steps to reproduce.

Configuration

Details
Assignee
Unassigned
Reporter
Mike Kelly
HB Grooming Date
Jul 21, 2020
HB Backlog Status
HB
Components
Notifications / Actions
Affects versions
24.1.3
Priority
Minor

PagerDuty

Created July 8, 2020 at 2:30 PM

Updated January 6, 2021 at 12:36 PM

Resolved January 6, 2021 at 12:36 PM

Notifications Not Auto-Acking

Description

Environment

Acceptance / Success Criteria

Confluence content

mentioned on

Lucidchart Diagrams

Activity

Mike Kelly January 6, 2021 at 12:36 PM

Mike Kelly January 6, 2021 at 12:35 PM

Sandy Skipper July 21, 2020 at 3:07 PM

Details
Assignee
Unassigned
Reporter
Mike Kelly
HB Grooming Date
Jul 21, 2020
HB Backlog Status
HB
Components
Notifications / Actions
Affects versions
24.1.3
Priority
Minor

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Components

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Notifications Not Auto-Acking

Description

Environment

Acceptance / Success Criteria

Confluence content

mentioned on

Lucidchart Diagrams

Activity

Mike Kelly January 6, 2021 at 12:36 PM

Mike Kelly January 6, 2021 at 12:35 PM

Sandy Skipper July 21, 2020 at 3:07 PM

DetailsAssigneeUnassignedUnassignedReporterMike KellyMike KellyHB Grooming DateJul 21, 2020HB Backlog StatusHBComponentsNotifications / ActionsAffects versions24.1.3PriorityMinor

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Components

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Unassigned
Reporter
Mike Kelly
HB Grooming Date
Jul 21, 2020
HB Backlog Status
HB
Components
Notifications / Actions
Affects versions
24.1.3
Priority
Minor

PagerDuty