Notifications Not Auto-Acking

Description

I have some services that are having frequent, small SNMP outages (down for ~30 seconds), and not all of the notifications are getting auto-acked. So, despite having a 2m delay on the notification path, I'm getting notified about issues that are already resolved, and not ever being told that they are resolved.

An example of a recent outage's timing:

Lost Service Time

2020-07-08T10:05:36-04:00

Regained Service Time

2020-07-08T10:06:07-04:00

But, the notification related to that (linked to the same outage event):

Notification Time

2020-07-08T10:05:37-04:00

Time Replied

 

Users Notified

Sent To

Sent At

Media

Contact Info

mkelly

2020-07-08T10:07:51-04:00

javaEmail

 

So, it didn't get acked at 10:06:07, so I got an email at 10:07:51, even though the outage was already resolved.

My config for notifications:

  • Using auto-acknowledge-alarm:

<auto-acknowledge-alarm resolution-prefix="RESOLVED: "> <uei>uei.opennms.org/nodes/serviceResponsive</uei> <uei>uei.opennms.org/nodes/nodeRegainedService</uei> <uei>uei.opennms.org/nodes/interfaceUp</uei> <uei>uei.opennms.org/nodes/nodeUp</uei> <uei>uei.opennms.org/correlation/remote/wideSpreadOutageResolved</uei> <!-- omit a few custom alarms for our environment --> <uei>uei.opennms.org/threshold/highThresholdRearmed</uei> <uei>uei.opennms.org/threshold/lowThresholdRearmed</uei> <uei>uei.opennms.org/internal/importer/importSuccessful</uei> </auto-acknowledge-alarm>
  • Default queue handler stuff:

<queue> <queue-id>default</queue-id> <interval>20s</interval> <handler-class> <name>org.opennms.netmgt.notifd.DefaultQueueHandler</name> </handler-class> </queue>
  • Destination path:

<path name="Email-Servers" initial-delay="2m"> <target> <name>Servers_OnCall</name> <autoNotify>on</autoNotify> <command>javaEmail</command> </target> </path>

 

Right now, nothing shows up for this whole month in notifd.log (I assume the default logging level isn't going to show me anything), but I do see this in alarmd for the specific alarm in question:

2020-07-08 10:10:31,613 WARN [alarmd-Thread-4-of-4] o.o.n.a.d.DroolsAlarmContext: Failed to acquire Drools session lock within 20000ms. Add or update for alarm with id=6059035
and reduction-key=uei.opennms.org/nodes/nodeLostService::2751:10.xx.xx.xx:SNMP will not be immediately reflected in the context.

Environment

RHEL 7, OpenJDK 8, PostgreSQL 9.2

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Mike Kelly January 6, 2021 at 12:36 PM

Updated event config to discard noisy traps.

Mike Kelly January 6, 2021 at 12:35 PM

I think the root cause here was that we had so many incoming traps at one point that OpenNMS was overwhelmed. Judicious use of discard trap event config fixed this for us.

Sandy Skipper July 21, 2020 at 3:07 PM

Please provide specific steps to reproduce.

Configuration

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Affects versions

Priority

PagerDuty

Created July 8, 2020 at 2:30 PM
Updated January 6, 2021 at 12:36 PM
Resolved January 6, 2021 at 12:36 PM

Flag notifications