Collectd in OpenNMS 22.0.4 stops working when collecting SNMP from ~10000 nodes through minions. At first, OpenNMS will use all available threads, and then it will rapidly start filling the pending tasks queue up until it reaches ~12k tasks, where it'll stay permanently.
All non-minion collection (servers in the "Default" location) also gets caught up waiting for available Collectd threads, so even though some data is collected, most threads seem to be stuck, waiting for something.
Pollerd seems to work just fine, no matter what happens with the RPC SNMP queue, but that may simply just be an effect of Collectd attempting to request a lot more data than Pollerd ever does from these nodes.
All the testing happens on a standby-server running the exact same setup as the current production server, which is still running OpenNMS 20.1.0.
On OpenNMS 20.1.0, two minions and 250 Collectd threads are more than enough to reliably collect all the nodes and interfaces on a five minute interval.
On the server running 22.0.4, I have tried with 2000 threads and three minions, with no improvement.
So far, I have tried a few different setups, in an attempt to rule out various factors:
- Collectd works perfectly after downgrading the "standby" server (and two of the minions) from 22.0.4 to 20.1.0. Active threads go up and back down (bottoms out at ~58 active threads) depending on activity, and the pending tasks queue is empty.
- Collectd works fine when I move all nodes to the "Default" location. Active threads follow the collection interval, and pending tasks queue is empty.
- Collectd works fine if I do some polling from various minion locations, but leave the bulk of the work to OpenNMS itself.
On the other hand, I have tried tweaking a lot of ActiveMQ settings, to no avail. All of these scenarios result in Collectd slowing to a crawl and exhausting all threads:
- OpenNMS' embedded ActiveMQ with no changes
- The embedded ActiveMQ with memoryUsage set to 1GB
- Standalone activemq-5.15.2 with our current settings from the 20.1.0 environment
- Standalone activemq-5.15.6 (currently the most recent stable release)
- Standalone activemq with flow control disabled, 500MB memory limit per queue, prefetch set to 10000, maxPageSize set to 2000, constantPendingMessageLimitStrategy set to 5000, and memoryUsage 2GB, storeUsage 2GB, tempUsage 1GB.
- Using one minion, two minions or three minions
None of these changes had any obvious effect, other than a lot more memory being used. The actual behaviour of OpenNMS and the minions seemingly didn't change much, if at all.
I have also confirmed that both the OpenNMS process and the minions have max open files properly set. (65000 or higher for all the relevant PIDs.)
Here are some graphs from when I "migrated" nodes from the Default location to a minion location. I have done this both with restarting and with simply reprovisioning and rescanning existing nodes, with the same effect.
Active threads vs. pool size:
Total tasks vs. completed tasks:
I have spent a lot of time digging through logs at various log levels, trying to find some hints about what the underlying cause is, but since the problem only occurs when I put real load on the minions, I haven't been able to come to any conclusions so far.
The only thing I know for sure is that OpenNMS starts logging a lot of warnings to ipc.log as the queue starts filling up.
As seen in the graphs, I started having the minions collect data around ~01:25.
Between 00:00 and 00:59, almost no warnings were logged:
Between 02:00 and 02:45, ipc.log was being spammed with warnings about unknown correlationIDs
Two examples of the full warning and error messages, in case that will help pinpoint the problem:
As mentioned, I uninstalled 22.0.4 and reinstalled 20.1.0 on this server plus the minions, importing the database from the production environment, and everything worked perfectly with the existing ActiveMQ config.
After upgrading to 22.0.4 again, I made sure to use the pristine configs, only changing the files we need to collect data from and poll our nodes.