Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-13232

Heartbeat topic lag with a large number of minions

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • High
    • Resolution: Fixed
    • 27.0.5
    • 27.2.0
    • Core, Minion
    • Security Level: Default (Default Security Scheme)
    • Horizon 2021 - Apr 14 - Apr 28
    • Backlog NG
    • 567

    Description

      A customer with a very large number of minions (~2500) is reporting that:

      • Minion-Heartbeat service is down on all or most minions
      • High and increasing lag on OpenNMS.Sink.Heartbeat kafka topic

      Customer installed version is 27.0.5, but I imagine the issue would be the same with 27.1.0 or 27.1.0.

      Thread dump shows the following stack trace for the consumer thread:

      "kafka-consumer-128" Id=1497 RUNNABLE
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillQName(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttribute(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttributes2(Unknown Source)
              at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.startElement(Unknown Source)
              at org.eclipse.persistence.internal.oxm.record.XMLReader$ValidatingContentHandler.startElement(XMLReader.java:431)
              at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
              at org.opennms.core.xml.SimpleNamespaceFilter.startElement(SimpleNamespaceFilter.java:83)
              at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
              at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
              at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
              at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
              at org.eclipse.persistence.internal.oxm.record.XMLReader.parse(XMLReader.java:221)
              at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:492)
              at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:695)
              at org.eclipse.persistence.oxm.XMLUnmarshaller.unmarshal(XMLUnmarshaller.java:655)
              at org.eclipse.persistence.jaxb.JAXBUnmarshaller.unmarshal(JAXBUnmarshaller.java:301)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:276)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:196)
              at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:189)
              at org.opennms.netmgt.provision.persist.RequisitionFileUtils.getRequisitionFromFile(RequisitionFileUtils.java:72)
              at org.opennms.netmgt.provision.persist.FilesystemForeignSourceRepository.getRequisition(FilesystemForeignSourceRepository.java:268)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.provision(HeartbeatConsumer.java:205)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:121)
              at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:67)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.lambda$dispatch$0(AbstractMessageConsumerManager.java:100)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager$$Lambda$1507/0x00007f483fb21cb0.accept(Unknown Source)
              at java.base@11.0.6/java.lang.Iterable.forEach(Iterable.java:75)
              at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.dispatch(AbstractMessageConsumerManager.java:100)
              at org.opennms.core.ipc.sink.kafka.server.KafkaMessageConsumerManager$KafkaConsumerRunner.run(KafkaMessageConsumerManager.java:214)
              at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
      

      Suspicion is that we can't process the volume of heartbeats fast enough at this scale, and we need to investigate either optimizing the consumer thread or adding additional consumer threads, or both.

      Attachments

        1. core.zip
          99 kB
        2. minion.zip
          237 kB

        Issue Links

          Activity

            People

              cgorantla Chandra Gorantla
              dino2gnt Dino Yancey
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: