Details
-
Bug
-
Status: Resolved (View Workflow)
-
Major
-
Resolution: Fixed
-
21.1.0, Meridian-2017.1.7
-
Security Level: Default (Default Security Scheme)
Description
On my lab (tested on latest develop and Meridian 2017), I found that when you monitor every single Cassandra instance of your current cluster, when one instance goes down, OpenNMS generates nodeLostService events for the JMX-Cassandra service for every single cluster member, not just the one that actually went down.
Here is how that service is defined:
<service name="JMX-Cassandra" interval="300000" user-defined="false" status="on"> <parameter key="port" value="7199"/> <parameter key="retry" value="2"/> <parameter key="timeout" value="3000"/> <parameter key="protocol" value="rmi"/> <parameter key="urlPath" value="/jmxrmi"/> <parameter key="rrd-base-name" value="jmx-cassandra"/> <parameter key="ds-name" value="jmx-cassandra"/> <parameter key="thresholding-enabled" value="true"/> <parameter key="factory" value="PASSWORD-CLEAR"/> <parameter key="username" value="cassandra"/> <parameter key="password" value="cassandra"/> <parameter key="rrd-repository" value="/opt/opennms/share/rrd/response"/> <parameter key="beans.storage" value="org.apache.cassandra.db:type=StorageService"/> <parameter key="tests.operational" value="storage.OperationMode == 'NORMAL'"/> <parameter key="tests.joined" value="storage.Joined"/> <parameter key="tests.unreachables" value="empty(storage.UnreachableNodes)"/> </service>
The last entry is the problem.
If I remove it from the configuration, now the service behaves as expected. It goes down only for the instance that is not working.
That means, the following line should not be part of the default configuration:
<parameter key="tests.unreachables" value="empty(storage.UnreachableNodes)"/>