Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-5874

30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.11.91
    • Fix Version/s: 1.11.91
    • Security Level: Default (Default Security Scheme)
    • Labels:
      None
    • Environment:

      Description

      After the switch from 1.10.9 to latest SNAPSHOT i've seen a lot of 30 second outages from nodes, which I didn't had on my old system with the same configuration. The poller.log shows the following error only once:

      2013-05-11 19:44:14,252 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.149
      java.lang.ArrayIndexOutOfBoundsException: 8
      at org.opennms.protocols.icmp.ICMPEchoPacket.storeToBuffer(ICMPEchoPacket.java:419)
      at org.opennms.protocols.icmp.ICMPEchoPacket.toBytes(ICMPEchoPacket.java:432)
      at org.opennms.netmgt.icmp.jni.JniPingRequest.send(JniPingRequest.java:255)
      at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:95)
      at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:46)
      at org.opennms.protocols.rt.RequestTracker.sendRequest(RequestTracker.java:203)
      at org.opennms.protocols.rt.RequestTracker.processNextTimeout(RequestTracker.java:273)
      at org.opennms.protocols.rt.RequestTracker.processTimeouts(RequestTracker.java:249)
      at org.opennms.protocols.rt.RequestTracker.access$3(RequestTracker.java:245)
      at org.opennms.protocols.rt.RequestTracker$2.run(RequestTracker.java:163)

      and this error message many times:

      2013-05-11 19:44:50,255 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.93
      java.lang.ArrayIndexOutOfBoundsException

      This issue has a side effect on the available nodes in the network and OpenNMS starts to record a lot of 30 second outages.

      To isolate the error, I've switched OpenNMS to JNA instead of JNI and the problem was completely gone. I have a second assumption, if all nodes in my network are available, then even with JNI you will see this error. If you have a high amount of nodes which are not available, then you can see this problem immediately.

      I tweaked the amount of poller-threads to 200 for 130 services and have seen the same effect.

      As a workaround I switch over to JNA instead of JNI. I have seen JNA adds latency to the response time tests, but this is not a big issue in my environment. Availability is more important.

      As far as I can see, it seems there is a relation between the error and a large amount of nodes which are not available in the network.

        Attachments

          Activity

            People

            • Assignee:
              brozow Matt Brozowski
              Reporter:
              indigo Ronny Trommer
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: