Newts: When Cassandra Cluster is unavailable, OpenNMS gives up on trying to contact it again

Description

Hi!

We recently ran into a situation where both of our Cassandra Servers were unavailable to OpenNMS for a couple of minutes. Even though those Servers were able to respond a short while after, OpenNMS didn't try to reconnect to them and didn't send any collected performance data to Casssandra. The logs revealed an exception which occured because of the connection loss. Only after restarting OpenNMS, the connection was re-established.

Please change the behaviour of OpenNMS in case of connection loss to all Cassandra Servers, so it tries to reconnect "some time" after the incident.

Thank you!

Environment

Centos

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Sebastian Kordzik November 4, 2016 at 3:13 AM

Thanks!

Looking forward to test this in the upcoming version.

Jesse White November 3, 2016 at 6:05 PM

Upgraded Newts in foundation-2016 with 21a1f5f70ccd4b64fcdd78a8384b759ea99a2486

Jesse White November 3, 2016 at 6:04 PM

I updated the Cassandra driver to 3.1.1 and reduced the maximum re-connection delay from 10 minutes to 2 minutes. With those changes, I was unable to reproduce this.

Jesse White August 11, 2016 at 9:23 AM

Thanks.

We currently use the default reconnection policy provided by the driver. I suspect that this may need some tuning.

Sebastian Kordzik August 11, 2016 at 8:26 AM

They were down for about 20-30 minutes. Maybe less. And.. they were not really down - those are virtual machines and they were backed up both at the same time. So it's probably more of a "delay answers a very long time" of down instead of a hard "machines went down". Not sure if that made a difference though.
It happend on a sunday morning and we only learned of it the following monday morning. So the time until restart of OpenNMS after the Cassandra Servers came back online was about 28 hours.

Fixed

Details

Assignee

Reporter

Components

Sprint

Priority

PagerDuty

Created August 11, 2016 at 5:32 AM
Updated November 4, 2016 at 10:53 AM
Resolved November 3, 2016 at 6:05 PM