Minion stops sending flow data into Kafka
Description
Environment
Acceptance / Success Criteria
Lucidchart Diagrams
Activity
Chandra Gorantla April 15, 2019 at 6:53 PM
Closing this as it requires new feature and https://issues.opennms.org/browse/HZN-1531 should resolve this
Chandra Gorantla April 15, 2019 at 6:30 PM
Created https://issues.opennms.org/browse/HZN-1531 for handling large buffers.
For the case of RecordTooLargeException
or any other exception that's not TimeoutException
we should drop the message as this is not non-recoverable.
Handled this here PR: https://github.com/OpenNMS/opennms/pull/2451
Sean Torres April 3, 2019 at 12:40 AM
Looks like there was a similar issue internally for Kafka around this.
Sean Torres April 3, 2019 at 12:34 AM
How about catching this "RecordTooLargeException" in its own catch block.
Count the number of individual messages being bundled for logging purposes (log as warn/debug)
If not a single message, break the message size in "half", and submitting the two new batch messages and break from the loop.
If single message, it will never send so log a FAIL and break instead of looping endlessly and holding resources.
Recursion should handle it enough while keeping batch size large since in this instance its not happening all the time. The count metric in the logs would help for tuning the batch.size per parser.
Sean Torres April 2, 2019 at 11:39 PM
Issue occurred again, connected to the debug port and evaluated the exception:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1061405 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration
After some unknown interval, minion fails to send data into Kafka
Below are excerpts from the logs which fill up almost instantly (see zgrep below)
2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-15 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-1 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-7 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-15 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-1 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,282 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-7 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,272 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-9 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2019-04-01T17:42:55,272 | WARN | OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-12 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. # zgrep Kafka * | sed 's/^[^\s]*\s//' | sort | uniq -cd | sort -n 2759 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-10 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2813 AggregatorFlush-Telemetry-Netflow-5 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 2853 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-6 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 27432 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-15 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 27490 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-12 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 27878 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-16 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 28051 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-9 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 28497 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-11 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 28639 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-3 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 28864 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-14 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 28987 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-2 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 29237 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-8 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 29391 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-1 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 29597 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-13 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 29875 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-7 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 29877 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-5 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again. 30124 OpenNMS.Sink.AsyncDispatcher.Telemetry-Netflow-5-Thread-4 | KafkaRemoteMessageDispatcherFactory | 249 - org.opennms.core.ipc.sink.kafka.client - 25.0.0.SNAPSHOT | Timeout occured while sending message to topic OpenNMS.Sink.Telemetry-Netflow-5, it will be attempted again.