A customer has been experiencing problems with file descriptors on their Minion, leading to having holes on the graphs, while the Minion is trying to execute RPC requests (and send SNMP PDU) when there are no FDs available.
Of course, declaring a big number for the MAX_FD environment variable at /etc/sysconfig/minion mitigates the problem.
The reason why I used the term "mitigate" instead of "solve" is because I found a strange behavior on the used threads and the used FDs on Minion right after issuing a request to import a requisition (see the attached graphs).
The test environment used to reproduce the problem has the following settings:
1) A VM running OpenNMS Horizon 24.1.3 on CentOS 7.
2) A VM running OpenNMS Minion 24.1.3 on OL 7.6 (as this is the operating system that the customer uses, but I believe it is irrelevant for this discussion).
3) Configure data collection at 30 seconds interval for all the protocols and all data collections.
4) Use the embedded ActiveMQ for Sink and RPC communication.
5) Start OpenNMS and the Minion.
6) Wait until the Minion is discovered.
7) Add a few requisitions behind the location configured for the Minion in question.
The spikes on the used threads graphs correspond to processing an import request for one of the requisitions. This requisition has about 30 nodes.
The first 2 imports are using the default set of detectors (there is no default-foreign-source.xml). The third request has a modified list of detectors, where all those related to JMX (Minion, Kafka, Cassandra, if I recall correctly) were removed.
It looks like the JMX detectors have a huge impact in terms of the number of threads requires to finish the scan, compared with the first 2 spikes.
Fortunately, the usage of threads goes to a stable amount after the requisition import is finished.
However, that doesn't seem to be the case of file descriptors used. On the attached graph, you can see the spikes, but after the import is done, it never goes to the original value, like if there are still file handlers or sockets open, leaving the Minion on a critical stage, as at any given time, the used FDs will reach the maximum value leading to "Too many open files" errors.
In this particular example, I've configured MAX_FD to be 1024, and the maximum was taken without issues.
The problem comes in this particular case if a forth request to import the requisition comes in. If this happens, as mentioned, the "Too many open files" error will appear on /opt/minion/logs/karaf.log, and also as the error reported by the failed RPC requests on the affected daemons within OpenNMS (collectd.log, poller.log, and provisiond.log). Just to be clear, OpenNMS has over 200000 FDs configured via opennms.conf, so those errors are not generated at OpenNMS, they are generated on Minion, and they are reflected as part of the failed requests attempted via RPC.
Without code changes, Minion would eventually have to be restarted under a schedule. Increasing MAX_FD will temporary add some breath, but as the value always increased, that will eventually trigger the problem.