Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-10363

Detect and Attempt to Restart Failed Drools Engines

    XMLWordPrintable

    Details

    • Sprint:
      Horizon - October 31st 2018, Horizon - November 14th 2018

      Description

      Some exceptions can cause a Drools engine to stop working entirely, while the Correlator module remains "running". In this state, OpenNMS will not stop cleanly, and must be killed.

      One such exception is java.util.ConcurrentModificationException:

       Exception in thread "FireTask" java.util.ConcurrentModificationException
          at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
          at java.util.ArrayList$Itr.next(ArrayList.java:851)
          at org.drools.core.phreak.PhreakFromNode.doLeftInserts(PhreakFromNode.java:110)
          at org.drools.core.phreak.PhreakFromNode.doNode(PhreakFromNode.java:68)
          at org.drools.core.phreak.RuleNetworkEvaluator.evalNode(RuleNetworkEvaluator.java:355)
          at org.drools.core.phreak.RuleNetworkEvaluator.innerEval(RuleNetworkEvaluator.java:301)
          at org.drools.core.phreak.RuleNetworkEvaluator.outerEval(RuleNetworkEvaluator.java:136)
          at org.drools.core.phreak.RuleNetworkEvaluator.evaluateNetwork(RuleNetworkEvaluator.java:94)
          at org.drools.core.phreak.RuleExecutor.evaluateNetwork(RuleExecutor.java:65)
          at org.drools.core.common.DefaultAgenda.evaluateEagerList(DefaultAgenda.java:983)
          at org.drools.core.common.DefaultAgenda.fireLoop(DefaultAgenda.java:1306)
          at org.drools.core.common.DefaultAgenda.fireUntilHalt(DefaultAgenda.java:1232)
          at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireUntilHalt(StatefulKnowledgeSessionImpl.java:1398)
          at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireUntilHalt(StatefulKnowledgeSessionImpl.java:1377)
          at org.opennms.netmgt.correlation.drools.DroolsCorrelationEngine.lambda$initialize$2(DroolsCorrelationEngine.java:217)
          at java.lang.Thread.run(Thread.java:745)

      Please improve the Correlator to be able to detect when an engine has failed, and attempt to restart it.
      If the engine cannot be started, there should be a notification mechanism, and it should be possible to stop OpenNMS without resorting to "kill $(cat ${OPENNMS_HOME}/logs/opennms.pid)".

      The current state also prevents cluster management software from identifying that part of the application has failed - "service opennms status" still says it's Running.
      There should be some way to signal a clustering tool that part of the application has failed, and it should be restarted.

        Attachments

          Activity

            People

            • Assignee:
              cgorantla Chandra Gorantla
              Reporter:
              wkeaney Will Keaney
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: