Incoming syslog/trap flood can overwhelm new handler code

Description

If a large backlog of syslog or trap messages generated by a Minion is waiting on a Kafka broker and OpenNMS is started up, it will attempt to stream all of the messages at once into the Camel messaging system. This will exhaust all of the Java heap space and lead to an OutOfMemoryError if the number of messages in the backlog is too large.

To provide back-pressure on this queue and prevent memory from being exhausted, we should make the incoming Camel SEDA queue have a limited size and mark it as "blockWhenFull".

It will normally be very close to empty because events will be processed quickly after they are received so putting a limit on the queue size should only come into play when there is a significant backlog of messages that has accumulated in the messaging channel.

Since most syslog and trap messages will be 1KB - 2KB in size, I would recommend a default queue size of 50,000 which should consume roughly 100MB of RAM under full load.

This needs to be done in 4 contexts:

blueprint-syslog-handler-default.xml
blueprint-syslog-handler-kafka-default.xml
blueprint-trapd-handler-default.xml
blueprint-trapd-handler-kafka-default.xml

Acceptance / Success Criteria

None

Linked issues

depends on

NMS-12038

Add blockOnFull behavior to event handlers to throttle execution

NMS-8957

Alarmd creates new database transaction for every event

Lucidchart Diagrams

Activity

Show:

Jesse White March 29, 2017 at 9:57 AM

When processing syslog/trap messages, we use sendNowSync which is now fully synchronous (see https://issues.opennms.org/browse/HZN-1034).

This mitigates the scenarios mentioned above and puts pressure back onto the broker or socket if the messages can't be processed fast enough.

Seth Leger February 1, 2017 at 10:49 AM

Reducing priority since we can mitigate this by bounding the Eventd queue.

Seth Leger December 14, 2016 at 10:44 AM

I added sizes and blockOnFull behavior to all of the SEDA queues and executors in the processing chain and this has solved the first criteria mentioned above.

https://github.com/OpenNMS/opennms/pull/1198
commit 118afabead5ac9f22a8627c81391871fb9ca036c

There are still problems on OpenNMS if you use unbounded Eventd handler queues and I'm working on a fix for that inside .

Seth Leger December 5, 2016 at 3:37 PM

There are two criteria for completion:

If Minion becomes disconnected from OpenNMS, incoming syslog and trap messages should not cause OutOfMemoryErrors. The queues should fill to a bounded size and then new syslog and traps will be dropped at the network level.

If the rate of incoming messages into OpenNMS exceeds the processing speed on OpenNMS, then OpenNMS should not fail with OutOfMemoryErrors and should provide processing back-pressure on the message broker by synchronously processing events.

Fixed

Details
Assignee
Seth Leger
Reporter
Seth Leger
Components
Sprint
None
Fix versions
19.1.0
Affects versions
19.0.0
Priority
Critical

PagerDuty

Created September 29, 2016 at 3:06 PM

Updated March 29, 2017 at 10:39 AM

Resolved March 29, 2017 at 9:58 AM

Incoming syslog/trap flood can overwhelm new handler code

Description

Acceptance / Success Criteria

Linked issues

depends on

Lucidchart Diagrams

Activity

Jesse White March 29, 2017 at 9:57 AM

Seth Leger February 1, 2017 at 10:49 AM

Seth Leger December 14, 2016 at 10:44 AM

Seth Leger December 5, 2016 at 3:37 PM

DetailsAssigneeSeth LegerSeth LegerReporterSeth LegerSeth LegerComponentsSprintNone+5Fix versions19.1.0Affects versions19.0.0PriorityCritical

Details

Assignee

Reporter

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Seth Leger
Reporter
Seth Leger
Components
Sprint
None
Fix versions
19.1.0
Affects versions
19.0.0
Priority
Critical

PagerDuty