Details
-
Type:
Enhancement
-
Status: Resolved (View Workflow)
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: Meridian-2017.1.5, Meridian-2018.1.1, 22.0.4
-
Fix Version/s: 23.0.1, Meridian-2017.1.13, Meridian-2018.1.3
-
Component/s: Correlator
-
Security Level: Default (Default Security Scheme)
-
Sprint:Horizon - October 31st 2018, Horizon - November 14th 2018
Description
Some exceptions can cause a Drools engine to stop working entirely, while the Correlator module remains "running". In this state, OpenNMS will not stop cleanly, and must be killed.
One such exception is java.util.ConcurrentModificationException:
Exception in thread "FireTask" java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.drools.core.phreak.PhreakFromNode.doLeftInserts(PhreakFromNode.java:110) at org.drools.core.phreak.PhreakFromNode.doNode(PhreakFromNode.java:68) at org.drools.core.phreak.RuleNetworkEvaluator.evalNode(RuleNetworkEvaluator.java:355) at org.drools.core.phreak.RuleNetworkEvaluator.innerEval(RuleNetworkEvaluator.java:301) at org.drools.core.phreak.RuleNetworkEvaluator.outerEval(RuleNetworkEvaluator.java:136) at org.drools.core.phreak.RuleNetworkEvaluator.evaluateNetwork(RuleNetworkEvaluator.java:94) at org.drools.core.phreak.RuleExecutor.evaluateNetwork(RuleExecutor.java:65) at org.drools.core.common.DefaultAgenda.evaluateEagerList(DefaultAgenda.java:983) at org.drools.core.common.DefaultAgenda.fireLoop(DefaultAgenda.java:1306) at org.drools.core.common.DefaultAgenda.fireUntilHalt(DefaultAgenda.java:1232) at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireUntilHalt(StatefulKnowledgeSessionImpl.java:1398) at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireUntilHalt(StatefulKnowledgeSessionImpl.java:1377) at org.opennms.netmgt.correlation.drools.DroolsCorrelationEngine.lambda$initialize$2(DroolsCorrelationEngine.java:217) at java.lang.Thread.run(Thread.java:745)
Please improve the Correlator to be able to detect when an engine has failed, and attempt to restart it.
If the engine cannot be started, there should be a notification mechanism, and it should be possible to stop OpenNMS without resorting to "kill $(cat ${OPENNMS_HOME}/logs/opennms.pid)".
The current state also prevents cluster management software from identifying that part of the application has failed - "service opennms status" still says it's Running.
There should be some way to signal a clustering tool that part of the application has failed, and it should be restarted.