Details
Description
A customer with a very large number of minions (~2500) is reporting that:
- Minion-Heartbeat service is down on all or most minions
- High and increasing lag on OpenNMS.Sink.Heartbeat kafka topic
Customer installed version is 27.0.5, but I imagine the issue would be the same with 27.1.0 or 27.1.0.
Thread dump shows the following stack trace for the consumer thread:
"kafka-consumer-128" Id=1497 RUNNABLE at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillQName(Unknown Source) at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttribute(Unknown Source) at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.fillXMLAttributes2(Unknown Source) at org.apache.xerces.jaxp.validation.ValidatorHandlerImpl.startElement(Unknown Source) at org.eclipse.persistence.internal.oxm.record.XMLReader$ValidatingContentHandler.startElement(XMLReader.java:431) at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551) at org.opennms.core.xml.SimpleNamespaceFilter.startElement(SimpleNamespaceFilter.java:83) at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at java.xml@11.0.6/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357) at org.eclipse.persistence.internal.oxm.record.XMLReader.parse(XMLReader.java:221) at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:492) at org.eclipse.persistence.internal.oxm.record.SAXUnmarshaller.unmarshal(SAXUnmarshaller.java:695) at org.eclipse.persistence.oxm.XMLUnmarshaller.unmarshal(XMLUnmarshaller.java:655) at org.eclipse.persistence.jaxb.JAXBUnmarshaller.unmarshal(JAXBUnmarshaller.java:301) at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:276) at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:196) at org.opennms.core.xml.JaxbUtils.unmarshal(JaxbUtils.java:189) at org.opennms.netmgt.provision.persist.RequisitionFileUtils.getRequisitionFromFile(RequisitionFileUtils.java:72) at org.opennms.netmgt.provision.persist.FilesystemForeignSourceRepository.getRequisition(FilesystemForeignSourceRepository.java:268) at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.provision(HeartbeatConsumer.java:205) at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:121) at org.opennms.minion.heartbeat.consumer.HeartbeatConsumer.handleMessage(HeartbeatConsumer.java:67) at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.lambda$dispatch$0(AbstractMessageConsumerManager.java:100) at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager$$Lambda$1507/0x00007f483fb21cb0.accept(Unknown Source) at java.base@11.0.6/java.lang.Iterable.forEach(Iterable.java:75) at org.opennms.core.ipc.sink.common.AbstractMessageConsumerManager.dispatch(AbstractMessageConsumerManager.java:100) at org.opennms.core.ipc.sink.kafka.server.KafkaMessageConsumerManager$KafkaConsumerRunner.run(KafkaMessageConsumerManager.java:214) at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
Suspicion is that we can't process the volume of heartbeats fast enough at this scale, and we need to investigate either optimizing the consumer thread or adding additional consumer threads, or both.
Attachments
Issue Links
- mentioned in
-
Page Loading...