30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation
Description
Environment
Acceptance / Success Criteria
Attachments
- 23 May 2013, 10:43 AM
- 11 May 2013, 04:36 PM
- 11 May 2013, 04:36 PM
Lucidchart Diagrams
Activity
Benjamin Reed July 29, 2013 at 6:12 PM
And for the native code... JICMP code was broken in 1.3.x and fixed in 1.4.0. JICMP6 was not affected.
Benjamin Reed July 29, 2013 at 6:10 PM
It was only broken for a few 1.11.x releases. It was never in 1.10, and it has been fixed in 1.11 and up since 1.11.91.
David Hustace July 29, 2013 at 5:23 PM
What minimum version is required to get these fixes?
Benjamin Reed May 23, 2013 at 11:03 AM
Matt forgot to close this bug.
This was fixed by a combination of the same change you made, and a rather fiddly change to JICMP itself, which then required an update to the POMs to pull it in.
See https://github.com/OpenNMS/jicmp/commit/051e02cb8ee0a474b879c29377b194231b4c224a for details of the changes needed in the JICMP side of things. (JICMP6 was unaffected.)
Simon Walter May 23, 2013 at 10:43 AM
I have creeated a patch based on Rons analysis.
Here is the branch:
https://github.com/OpenNMS/opennms/tree/realthargor/NMS-5874
A Jar with form 1.13 with the fix is attached. Can someone test it?
Details
Assignee
Matt BrozowskiMatt BrozowskiReporter
Ronny TrommerRonny TrommerComponents
Fix versions
Affects versions
Priority
Blocker
Details
Details
Assignee
Reporter
Components
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

After the switch from 1.10.9 to latest SNAPSHOT i've seen a lot of 30 second outages from nodes, which I didn't had on my old system with the same configuration. The poller.log shows the following error only once:
2013-05-11 19:44:14,252 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.149
java.lang.ArrayIndexOutOfBoundsException: 8
at org.opennms.protocols.icmp.ICMPEchoPacket.storeToBuffer(ICMPEchoPacket.java:419)
at org.opennms.protocols.icmp.ICMPEchoPacket.toBytes(ICMPEchoPacket.java:432)
at org.opennms.netmgt.icmp.jni.JniPingRequest.send(JniPingRequest.java:255)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:95)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:46)
at org.opennms.protocols.rt.RequestTracker.sendRequest(RequestTracker.java:203)
at org.opennms.protocols.rt.RequestTracker.processNextTimeout(RequestTracker.java:273)
at org.opennms.protocols.rt.RequestTracker.processTimeouts(RequestTracker.java:249)
at org.opennms.protocols.rt.RequestTracker.access$3(RequestTracker.java:245)
at org.opennms.protocols.rt.RequestTracker$2.run(RequestTracker.java:163)
and this error message many times:
2013-05-11 19:44:50,255 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.93
java.lang.ArrayIndexOutOfBoundsException
This issue has a side effect on the available nodes in the network and OpenNMS starts to record a lot of 30 second outages.
To isolate the error, I've switched OpenNMS to JNA instead of JNI and the problem was completely gone. I have a second assumption, if all nodes in my network are available, then even with JNI you will see this error. If you have a high amount of nodes which are not available, then you can see this problem immediately.
I tweaked the amount of poller-threads to 200 for 130 services and have seen the same effect.
As a workaround I switch over to JNA instead of JNI. I have seen JNA adds latency to the response time tests, but this is not a big issue in my environment. Availability is more important.
As far as I can see, it seems there is a relation between the error and a large amount of nodes which are not available in the network.