30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation

Description

After the switch from 1.10.9 to latest SNAPSHOT i've seen a lot of 30 second outages from nodes, which I didn't had on my old system with the same configuration. The poller.log shows the following error only once:

2013-05-11 19:44:14,252 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.149
java.lang.ArrayIndexOutOfBoundsException: 8
at org.opennms.protocols.icmp.ICMPEchoPacket.storeToBuffer(ICMPEchoPacket.java:419)
at org.opennms.protocols.icmp.ICMPEchoPacket.toBytes(ICMPEchoPacket.java:432)
at org.opennms.netmgt.icmp.jni.JniPingRequest.send(JniPingRequest.java:255)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:95)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:46)
at org.opennms.protocols.rt.RequestTracker.sendRequest(RequestTracker.java:203)
at org.opennms.protocols.rt.RequestTracker.processNextTimeout(RequestTracker.java:273)
at org.opennms.protocols.rt.RequestTracker.processTimeouts(RequestTracker.java:249)
at org.opennms.protocols.rt.RequestTracker.access$3(RequestTracker.java:245)
at org.opennms.protocols.rt.RequestTracker$2.run(RequestTracker.java:163)

and this error message many times:

2013-05-11 19:44:50,255 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.93
java.lang.ArrayIndexOutOfBoundsException

This issue has a side effect on the available nodes in the network and OpenNMS starts to record a lot of 30 second outages.

To isolate the error, I've switched OpenNMS to JNA instead of JNI and the problem was completely gone. I have a second assumption, if all nodes in my network are available, then even with JNI you will see this error. If you have a high amount of nodes which are not available, then you can see this problem immediately.

I tweaked the amount of poller-threads to 200 for 130 services and have seen the same effect.

As a workaround I switch over to JNA instead of JNI. I have seen JNA adds latency to the response time tests, but this is not a big issue in my environment. Availability is more important.

As far as I can see, it seems there is a relation between the error and a large amount of nodes which are not available in the network.

Environment

[root@carla etc]# yum info opennms Loaded plugins: langpacks, presto, refresh-packagekit Installed Packages Name : opennms Arch : noarch Version : 1.11.91 Release : 0.20130510.7 Size : 0.0 Repo : installed From repo : opennms-snapshot-common Summary : Enterprise-grade Network Management Platform (Easy Install) URL : http://www.opennms.org/ License : LGPL/GPL Description : OpenNMS is an enterprise-grade network management platform. : : This package used to contain what is now in the "opennms-core" package. : It now exists to give a reasonable default installation of OpenNMS. : : When you install this package, you will likely also need to install the : webapp package. : : This is an OpenNMS build from the (no branch. For a complete log, see: : http://opennms.git.sourceforge.net/git/gitweb.cgi?p=opennms/opennms;a=shortlog;h=58ce2a9750a19cd5ca88c5c84a124b8986234c20

Acceptance / Success Criteria

None

Attachments

3
  • 23 May 2013, 10:43 AM
  • 11 May 2013, 04:36 PM
  • 11 May 2013, 04:36 PM

Lucidchart Diagrams

Activity

Benjamin Reed July 29, 2013 at 6:12 PM

And for the native code... JICMP code was broken in 1.3.x and fixed in 1.4.0. JICMP6 was not affected.

Benjamin Reed July 29, 2013 at 6:10 PM

It was only broken for a few 1.11.x releases. It was never in 1.10, and it has been fixed in 1.11 and up since 1.11.91.

David Hustace July 29, 2013 at 5:23 PM

What minimum version is required to get these fixes?

Benjamin Reed May 23, 2013 at 11:03 AM

Matt forgot to close this bug. slightly smiling face

This was fixed by a combination of the same change you made, and a rather fiddly change to JICMP itself, which then required an update to the POMs to pull it in.

See https://github.com/OpenNMS/jicmp/commit/051e02cb8ee0a474b879c29377b194231b4c224a for details of the changes needed in the JICMP side of things. (JICMP6 was unaffected.)

Simon Walter May 23, 2013 at 10:43 AM

I have creeated a patch based on Rons analysis.

Here is the branch:
https://github.com/OpenNMS/opennms/tree/realthargor/NMS-5874

A Jar with form 1.13 with the fix is attached. Can someone test it?

Fixed

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

PagerDuty

Created May 11, 2013 at 1:59 PM
Updated January 27, 2017 at 4:21 PM
Resolved May 23, 2013 at 11:03 AM

Flag notifications