30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation

Description

After the switch from 1.10.9 to latest SNAPSHOT i've seen a lot of 30 second outages from nodes, which I didn't had on my old system with the same configuration. The poller.log shows the following error only once:

2013-05-11 19:44:14,252 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.149
java.lang.ArrayIndexOutOfBoundsException: 8
at org.opennms.protocols.icmp.ICMPEchoPacket.storeToBuffer(ICMPEchoPacket.java:419)
at org.opennms.protocols.icmp.ICMPEchoPacket.toBytes(ICMPEchoPacket.java:432)
at org.opennms.netmgt.icmp.jni.JniPingRequest.send(JniPingRequest.java:255)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:95)
at org.opennms.netmgt.icmp.jni.JniIcmpMessenger.sendRequest(JniIcmpMessenger.java:46)
at org.opennms.protocols.rt.RequestTracker.sendRequest(RequestTracker.java:203)
at org.opennms.protocols.rt.RequestTracker.processNextTimeout(RequestTracker.java:273)
at org.opennms.protocols.rt.RequestTracker.processTimeouts(RequestTracker.java:249)
at org.opennms.protocols.rt.RequestTracker.access$3(RequestTracker.java:245)
at org.opennms.protocols.rt.RequestTracker$2.run(RequestTracker.java:163)

and this error message many times:

2013-05-11 19:44:50,255 INFO [JNI-ICMP-57-Timeout-Processor] SinglePingResponseCallback: an error occurred pinging /192.168.30.93
java.lang.ArrayIndexOutOfBoundsException

This issue has a side effect on the available nodes in the network and OpenNMS starts to record a lot of 30 second outages.

To isolate the error, I've switched OpenNMS to JNA instead of JNI and the problem was completely gone. I have a second assumption, if all nodes in my network are available, then even with JNI you will see this error. If you have a high amount of nodes which are not available, then you can see this problem immediately.

I tweaked the amount of poller-threads to 200 for 130 services and have seen the same effect.

As a workaround I switch over to JNA instead of JNI. I have seen JNA adds latency to the response time tests, but this is not a big issue in my environment. Availability is more important.

As far as I can see, it seems there is a relation between the error and a large amount of nodes which are not available in the network.

Environment

[root@carla etc]# yum info opennms Loaded plugins: langpacks, presto, refresh-packagekit Installed Packages Name : opennms Arch : noarch Version : 1.11.91 Release : 0.20130510.7 Size : 0.0 Repo : installed From repo : opennms-snapshot-common Summary : Enterprise-grade Network Management Platform (Easy Install) URL : http://www.opennms.org/ License : LGPL/GPL Description : OpenNMS is an enterprise-grade network management platform. : : This package used to contain what is now in the "opennms-core" package. : It now exists to give a reasonable default installation of OpenNMS. : : When you install this package, you will likely also need to install the : webapp package. : : This is an OpenNMS build from the (no branch. For a complete log, see: : http://opennms.git.sourceforge.net/git/gitweb.cgi?p=opennms/opennms;a=shortlog;h=58ce2a9750a19cd5ca88c5c84a124b8986234c20

Acceptance / Success Criteria

None

Attachments

23 May 2013, 10:43 AM
11 May 2013, 04:36 PM
11 May 2013, 04:36 PM

Lucidchart Diagrams

Activity

Benjamin Reed July 29, 2013 at 6:12 PM

And for the native code... JICMP code was broken in 1.3.x and fixed in 1.4.0. JICMP6 was not affected.

Benjamin Reed July 29, 2013 at 6:10 PM

It was only broken for a few 1.11.x releases. It was never in 1.10, and it has been fixed in 1.11 and up since 1.11.91.

David Hustace July 29, 2013 at 5:23 PM

What minimum version is required to get these fixes?

Benjamin Reed May 23, 2013 at 11:03 AM

Matt forgot to close this bug.

This was fixed by a combination of the same change you made, and a rather fiddly change to JICMP itself, which then required an update to the POMs to pull it in.

See https://github.com/OpenNMS/jicmp/commit/051e02cb8ee0a474b879c29377b194231b4c224a for details of the changes needed in the JICMP side of things. (JICMP6 was unaffected.)

Simon Walter May 23, 2013 at 10:43 AM

I have creeated a patch based on Rons analysis.

Here is the branch:
https://github.com/OpenNMS/opennms/tree/realthargor/NMS-5874

A Jar with form 1.13 with the fix is attached. Can someone test it?

Fixed

Details
Assignee
Matt Brozowski
Reporter
Ronny Trommer
Components
Polling / Monitors / Outages
Fix versions
1.11.91
Affects versions
1.11.91
Priority
Blocker

PagerDuty

Created May 11, 2013 at 1:59 PM

Updated January 27, 2017 at 4:21 PM

Resolved May 23, 2013 at 11:03 AM

30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation

Description

Environment

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

Benjamin Reed July 29, 2013 at 6:12 PM

Benjamin Reed July 29, 2013 at 6:10 PM

David Hustace July 29, 2013 at 5:23 PM

Benjamin Reed May 23, 2013 at 11:03 AM

Simon Walter May 23, 2013 at 10:43 AM

Details
Assignee
Matt Brozowski
Reporter
Ronny Trommer
Components
Polling / Monitors / Outages
Fix versions
1.11.91
Affects versions
1.11.91
Priority
Blocker

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

30 second outages caused by ArrayOutOfBoundException in JNI-Ping implementation

Description

Environment

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

Benjamin Reed July 29, 2013 at 6:12 PM

Benjamin Reed July 29, 2013 at 6:10 PM

David Hustace July 29, 2013 at 5:23 PM

Benjamin Reed May 23, 2013 at 11:03 AM

Simon Walter May 23, 2013 at 10:43 AM

DetailsAssigneeMatt BrozowskiMatt BrozowskiReporterRonny TrommerRonny TrommerComponentsPolling / Monitors / OutagesFix versions1.11.91Affects versions1.11.91PriorityBlocker

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Matt Brozowski
Reporter
Ronny Trommer
Components
Polling / Monitors / Outages
Fix versions
1.11.91
Affects versions
1.11.91
Priority
Blocker

PagerDuty