Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-12801

Notifications Not Auto-Acking

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Minor
    • Resolution: Configuration
    • Affects Version/s: 24.1.3
    • Fix Version/s: None
    • Security Level: Default (Default Security Scheme)
    • Labels:
      None
    • Environment:
      RHEL 7, OpenJDK 8, PostgreSQL 9.2
    • HB Backlog Status:
      HB

      Description

      I have some services that are having frequent, small SNMP outages (down for ~30 seconds), and not all of the notifications are getting auto-acked. So, despite having a 2m delay on the notification path, I'm getting notified about issues that are already resolved, and not ever being told that they are resolved.

      An example of a recent outage's timing:

      Lost Service Time 2020-07-08T10:05:36-04:00
      Regained Service Time 2020-07-08T10:06:07-04:00

      But, the notification related to that (linked to the same outage event):

      Notification Time 2020-07-08T10:05:37-04:00 Time Replied  

      Users Notified

      Sent To Sent At Media Contact Info
      mkelly 2020-07-08T10:07:51-04:00 javaEmail  

      So, it didn't get acked at 10:06:07, so I got an email at 10:07:51, even though the outage was already resolved.

      My config for notifications:

      • Using auto-acknowledge-alarm:
         <auto-acknowledge-alarm resolution-prefix="RESOLVED: ">
            <uei>uei.opennms.org/nodes/serviceResponsive</uei>
            <uei>uei.opennms.org/nodes/nodeRegainedService</uei>
            <uei>uei.opennms.org/nodes/interfaceUp</uei>
            <uei>uei.opennms.org/nodes/nodeUp</uei>
            <uei>uei.opennms.org/correlation/remote/wideSpreadOutageResolved</uei>
            <!-- omit a few custom alarms for our environment -->
            <uei>uei.opennms.org/threshold/highThresholdRearmed</uei>
            <uei>uei.opennms.org/threshold/lowThresholdRearmed</uei>
            <uei>uei.opennms.org/internal/importer/importSuccessful</uei>
         </auto-acknowledge-alarm>
      
      • Default queue handler stuff:
         <queue>
            <queue-id>default</queue-id>
            <interval>20s</interval>
            <handler-class&gt;
               <name>org.opennms.netmgt.notifd.DefaultQueueHandler</name>
            </handler-class&gt;
         </queue>
      
      • Destination path:
         <path name="Email-Servers" initial-delay="2m">
            <target>
               <name>Servers_OnCall</name>
               <autoNotify>on</autoNotify>
               <command>javaEmail</command>
            </target>
         </path>
      

       

      Right now, nothing shows up for this whole month in notifd.log (I assume the default logging level isn't going to show me anything), but I do see this in alarmd for the specific alarm in question:

      2020-07-08 10:10:31,613 WARN [alarmd-Thread-4-of-4] o.o.n.a.d.DroolsAlarmContext: Failed to acquire Drools session lock within 20000ms. Add or update for alarm with id=6059035
      and reduction-key=uei.opennms.org/nodes/nodeLostService::2751:10.xx.xx.xx:SNMP will not be immediately reflected in the context.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              pioto Mike Kelly
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                HB Grooming Date:

                  Git Integration