Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-13232

Heartbeat topic lag with a large number of minions

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: High
    • Resolution: Fixed
    • Affects Version/s: 27.0.5
    • Fix Version/s: 27.2.0
    • Component/s: Core, Minion
    • Security Level: Default (Default Security Scheme)
    • Labels:
    • Sprint:
      Horizon 2021 - Apr 14 - Apr 28
    • HB Backlog Status:
      Backlog NG
    • FD#:
      567

      Description

      A customer with a very large number of minions (~2500) is reporting that:

      • Minion-Heartbeat service is down on all or most minions
      • High and increasing lag on OpenNMS.Sink.Heartbeat kafka topic

      Customer installed version is 27.0.5, but I imagine the issue would be the same with 27.1.0 or 27.1.0.

      Thread dump shows the following stack trace for the consumer thread:

      "kafka-consumer-128" Id=1497 RUNNABLE
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillQName(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttribute(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttributes2(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.startElement(Unknown Source)
              at org.eclipse.persistence.internal.oxm.record.XMLReader$ValidatingContentHandler.startElement(XMLReader.java:431)
              at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
              at org.opennms.core.xml.SimpleNamespaceFilter.startElement(SimpleNamespaceFilter.java:83)
              at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
              at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
              at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
              at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
              at org.eclipse.persistence.internal.oxm.record.XMLReader.parse(XMLReader.java:221)
              at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:492)
              at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:695)
              at org.eclipse.persistence.oxm.XMLUnmarshaller.unmarshal(XMLUnmarshaller.java:655)
              at org.eclipse.persistence.jaxb.JAXBUnmarshaller.unmarshal(JAXBUnmarshaller.java:301)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:276)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:196)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:189)
              at org.opennms.netmgt.provision.persist.RequisitionFileUtils.getRequisitionFromFile(RequisitionFileUtils.java:72)
              at org.opennms.netmgt.provision.persist.FilesystemForeignSourceRepository.getRequisition(FilesystemForeignSourceRepository.java:268)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.provision(HeartbeatConsumer.java:205)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:121)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:67)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.lambda$dispatch$0(AbstractMessageConsumerManager.java:100)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager$$Lambda$1507/0x00007f483fb21cb0.accept(Unknown Source)
              at java.base@11.0.6/java.lang.Iterable.forEach(Iterable.java:75)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.dispatch(AbstractMessageConsumerManager.java:100)
              at org.opennms.core.ipc.sink.kafka.server.KafkaMessageConsumerManager$KafkaConsumerRunner.run(KafkaMessageConsumerManager.java:214)
              at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
      

      Suspicion is that we can't process the volume of heartbeats fast enough at this scale, and we need to investigate either optimizing the consumer thread or adding additional consumer threads, or both.

        Attachments

        1. core.zip
          99 kB
        2. minion.zip
          237 kB

          Issue Links

            Activity

              People

              Assignee:
              cgorantla Chandra Gorantla
              Reporter:
              dino2gnt Dino Yancey
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration