Incoming syslog/trap flood can overwhelm new handler code
Description
Acceptance / Success Criteria
Lucidchart Diagrams
Activity

Jesse White March 29, 2017 at 9:57 AM
When processing syslog/trap messages, we use sendNowSync
which is now fully synchronous (see https://issues.opennms.org/browse/HZN-1034).
This mitigates the scenarios mentioned above and puts pressure back onto the broker or socket if the messages can't be processed fast enough.

Seth Leger February 1, 2017 at 10:49 AM
Reducing priority since we can mitigate this by bounding the Eventd queue.

Seth Leger December 14, 2016 at 10:44 AM
I added sizes and blockOnFull behavior to all of the SEDA queues and executors in the processing chain and this has solved the first criteria mentioned above.
https://github.com/OpenNMS/opennms/pull/1198
commit 118afabead5ac9f22a8627c81391871fb9ca036c
There are still problems on OpenNMS if you use unbounded Eventd handler queues and I'm working on a fix for that inside .

Seth Leger December 5, 2016 at 3:37 PM
There are two criteria for completion:
If Minion becomes disconnected from OpenNMS, incoming syslog and trap messages should not cause OutOfMemoryErrors. The queues should fill to a bounded size and then new syslog and traps will be dropped at the network level.
If the rate of incoming messages into OpenNMS exceeds the processing speed on OpenNMS, then OpenNMS should not fail with OutOfMemoryErrors and should provide processing back-pressure on the message broker by synchronously processing events.
If a large backlog of syslog or trap messages generated by a Minion is waiting on a Kafka broker and OpenNMS is started up, it will attempt to stream all of the messages at once into the Camel messaging system. This will exhaust all of the Java heap space and lead to an OutOfMemoryError if the number of messages in the backlog is too large.
To provide back-pressure on this queue and prevent memory from being exhausted, we should make the incoming Camel SEDA queue have a limited size and mark it as "blockWhenFull".
It will normally be very close to empty because events will be processed quickly after they are received so putting a limit on the queue size should only come into play when there is a significant backlog of messages that has accumulated in the messaging channel.
Since most syslog and trap messages will be 1KB - 2KB in size, I would recommend a default queue size of 50,000 which should consume roughly 100MB of RAM under full load.
This needs to be done in 4 contexts:
blueprint-syslog-handler-default.xml
blueprint-syslog-handler-kafka-default.xml
blueprint-trapd-handler-default.xml
blueprint-trapd-handler-kafka-default.xml