Strange behavior on used threads and file descriptors on Minion
Description
Acceptance / Success Criteria
Attachments
Lucidchart Diagrams
Activity

Chandra Gorantla November 7, 2019 at 9:05 PM
Created NMS-12391 for limiting number of threads at a given point of time and handle the pending RPCs from the queue.

Chandra Gorantla November 7, 2019 at 3:45 PM
File descriptors issue is not reproducible on release-25
or foundation-2019
.
The number or threads completely on Minion depend on number of detectors present in default foreign source and number of interfaces on a node. Minion doesn't have any limitation on number of requests it can handle there by number of threads it can spawn. The only way to scale minion is to add more minions at a given location.
Closing it as Cannot Reproduce as the main issue of file descriptors is not reproducible on release-25.x

Chandra Gorantla November 4, 2019 at 3:58 AM
As clarified in #comment-3 I also don't see any file descriptors issue on foundation-2019
. file descriptors usage seems to fall back to previous constant value. Even the thread usage seems to be limited in my experiments with more than 50 nodes. May be there are more failures in detectors in my minion setup that could have lead to less number of threads at a given time. All RPCs received in Minion were completed.
Since there are three JMX detectors in default foreign source definition without ipMatch
that seems to be main reason for having more threads when JMX is involved. I don't see any obvious issue to fix here. File Descriptors/Thread usage falls back to previous value and all RPCs were completed by minion.

Alejandro Galue October 31, 2019 at 7:41 PM
The numbers we're seeing on the graphs are legit.
I've created a couple of scripts to monitor the FDs and Threads using /proc
every second for verification purposes.
Here is what I've used:
For File Descriptors
For Threads

Alejandro Galue October 31, 2019 at 7:27 PM
I repeated the tests with latest snapshot of Horizon 25.1.0.
The behavior with file descriptors is different. Even if the amount of used FDs spikes while running detectors, it goes back to the stationary value.
Fortunately, that is not a problem in H25+.
Unfortunately, the number of used threads and even the number of FDs seems way higher. In terms of threads it is so high that the Minion VM becames unresponsive for a while, I suspect due to high CPU usage while all those threads are running.
In terms of values, we're talking about imoprting 3 requisitions with about 50 nodes among them using the default foreign source.
The number of FDs goes from around 520 to over 1500 in average. For threads is worst, they go from about 220 to over 2000 in some ocations.
I removed all the JMX related detectors and repeated the tests, and the results are different.
Now, the numbers don't increase in the same way as before which makes me think that the main problem are the JMX detectors. The number of threads went from 220 to around 600, and with FDs, from 520 to around 900, which is less, but in my personal opinion, still high. If this is the expected behavior and there is nothing we can do about, I'd like an explanation so I can use when customers ask.
In conclusion, having JMX detectors introduce bad side effects in terms of number of FDs and thread usage on Minion.
However, considering we're talking about just 50 nodes in this test scenario, I'm worried about environments with hundreds of big switches behind a Minion with hundreds or thousands of IP interfaces with multiple detectors to scan, even if JMX are not involved.
There are NO increase on FDs or Threads while doing normal Polling and Collection. The spikes appear ONLY when requesting multiple imports, or imports on big requisitions.
A customer has been experiencing problems with file descriptors on their Minion, leading to having holes on the graphs, while the Minion is trying to execute RPC requests (and send SNMP PDU) when there are no FDs available.
Of course, declaring a big number for the
MAX_FD
environment variable at/etc/sysconfig/minion
mitigates the problem.The reason why I used the term "mitigate" instead of "solve" is because I found a strange behavior on the used threads and the used FDs on Minion right after issuing a request to import a requisition (see the attached graphs).
The test environment used to reproduce the problem has the following settings:
1) A VM running OpenNMS Horizon 24.1.3 on CentOS 7.
2) A VM running OpenNMS Minion 24.1.3 on OL 7.6 (as this is the operating system that the customer uses, but I believe it is irrelevant for this discussion).
3) Configure data collection at 30 seconds interval for all the protocols and all data collections.
4) Use the embedded ActiveMQ for Sink and RPC communication.
5) Start OpenNMS and the Minion.
6) Wait until the Minion is discovered.
7) Add a few requisitions behind the location configured for the Minion in question.
The spikes on the used threads graphs correspond to processing an import request for one of the requisitions. This requisition has about 30 nodes.
The first 2 imports are using the default set of detectors (there is no default-foreign-source.xml). The third request has a modified list of detectors, where all those related to JMX (Minion, Kafka, Cassandra, if I recall correctly) were removed.
It looks like the JMX detectors have a huge impact in terms of the number of threads requires to finish the scan, compared with the first 2 spikes.
Fortunately, the usage of threads goes to a stable amount after the requisition import is finished.
However, that doesn't seem to be the case of file descriptors used. On the attached graph, you can see the spikes, but after the import is done, it never goes to the original value, like if there are still file handlers or sockets open, leaving the Minion on a critical stage, as at any given time, the used FDs will reach the maximum value leading to "Too many open files" errors.
In this particular example, I've configured MAX_FD to be 1024, and the maximum was taken without issues.
The problem comes in this particular case if a forth request to import the requisition comes in. If this happens, as mentioned, the "Too many open files" error will appear on /opt/minion/logs/karaf.log, and also as the error reported by the failed RPC requests on the affected daemons within OpenNMS (collectd.log, poller.log, and provisiond.log). Just to be clear, OpenNMS has over 200000 FDs configured via opennms.conf, so those errors are not generated at OpenNMS, they are generated on Minion, and they are reflected as part of the failed requests attempted via RPC.
Without code changes, Minion would eventually have to be restarted under a schedule. Increasing MAX_FD will temporary add some breath, but as the value always increased, that will eventually trigger the problem.