Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-10389

Unable to collect SNMP through minions on a large scale

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 22.0.4
    • Fix Version/s: 26.1.2
    • Security Level: Default (Default Security Scheme)
    • Labels:
      None
    • Environment:
      OpenNMS 22.0.4
      Minion(s)
      Debian 9.5
      ~265000 SNMP interfaces

      Description

      Collectd in OpenNMS 22.0.4 stops working when collecting SNMP from ~10000 nodes through minions. At first, OpenNMS will use all available threads, and then it will rapidly start filling the pending tasks queue up until it reaches ~12k tasks, where it'll stay permanently.

      All non-minion collection (servers in the "Default" location) also gets caught up waiting for available Collectd threads, so even though some data is collected, most threads seem to be stuck, waiting for something.

      Pollerd seems to work just fine, no matter what happens with the RPC SNMP queue, but that may simply just be an effect of Collectd attempting to request a lot more data than Pollerd ever does from these nodes.

      All the testing happens on a standby-server running the exact same setup as the current production server, which is still running OpenNMS 20.1.0.
      On OpenNMS 20.1.0, two minions and 250 Collectd threads are more than enough to reliably collect all the nodes and interfaces on a five minute interval.

      On the server running 22.0.4, I have tried with 2000 threads and three minions, with no improvement.

      So far, I have tried a few different setups, in an attempt to rule out various factors:

      • Collectd works perfectly after downgrading the "standby" server (and two of the minions) from 22.0.4 to 20.1.0. Active threads go up and back down (bottoms out at ~58 active threads) depending on activity, and the pending tasks queue is empty.
      • Collectd works fine when I move all nodes to the "Default" location. Active threads follow the collection interval, and pending tasks queue is empty.
      • Collectd works fine if I do some polling from various minion locations, but leave the bulk of the work to OpenNMS itself.

      On the other hand, I have tried tweaking a lot of ActiveMQ settings, to no avail. All of these scenarios result in Collectd slowing to a crawl and exhausting all threads:

      • OpenNMS' embedded ActiveMQ with no changes
      • The embedded ActiveMQ with memoryUsage set to 1GB
      • Standalone activemq-5.15.2 with our current settings from the 20.1.0 environment
      • Standalone activemq-5.15.6 (currently the most recent stable release)
      • Standalone activemq with flow control disabled, 500MB memory limit per queue, prefetch set to 10000, maxPageSize set to 2000, constantPendingMessageLimitStrategy set to 5000, and memoryUsage 2GB, storeUsage 2GB, tempUsage 1GB.
      • Using one minion, two minions or three minions

      None of these changes had any obvious effect, other than a lot more memory being used. The actual behaviour of OpenNMS and the minions seemingly didn't change much, if at all.

      I have also confirmed that both the OpenNMS process and the minions have max open files properly set. (65000 or higher for all the relevant PIDs.)

       

      Here are some graphs from when I "migrated" nodes from the Default location to a minion location. I have done this both with restarting and with simply reprovisioning and rescanning existing nodes, with the same effect.

      Active threads vs. pool size:

      Total tasks vs. completed tasks:

      Pending tasks:

      I have spent a lot of time digging through logs at various log levels, trying to find some hints about what the underlying cause is, but since the problem only occurs when I put real load on the minions, I haven't been able to come to any conclusions so far.

      The only thing I know for sure is that OpenNMS starts logging a lot of warnings to ipc.log as  the queue starts filling up.

      As seen in the graphs, I started having the minions collect data around ~01:25.

      Between 00:00 and 00:59, almost no warnings were logged:

      $ grep -c '2018-10-03 00.*Reply received for unknown correlationID' ipc.log 
      1
      $ grep -c '2018-10-03 00.*The OUT message was not received within' ipc.log 
      9
      

      Between 02:00 and 02:45, ipc.log was being spammed with warnings about unknown correlationIDs

      $ grep -c '2018-10-03 02.*Reply received for unknown correlationID' ipc.log 
      3229
      $ grep -c '2018-10-03 02.*The OUT message was not received within' ipc.log 
      3229
      

      Two examples of the full warning and error messages, in case that will help pinpoint the problem:

      2018-10-03 02:32:12,731 WARN  [Camel (rpcClient) thread #841 - JmsReplyManagerOnTimeout[OpenNMS.<minion-location>.RPC.SNMP]] o.a.c.c.j.r.TemporaryQueueReplyManager: Timeout occurred after 300000 millis waiting for reply message with correlationID [Camel-ID-<opennms-hostname>-46255-1538512560361-0-1133068] on destination temp-queue://ID:<opennms.hostname>-34049-1538512552976-1:9:1. Setting ExchangeTimedOutException on (MessageId: ID-<opennms-hostname>-46255-1538512560361-0-1142460 on ExchangeId: ID-<opennms-hostname>-46255-1538512560361-0-1133067) and continue routing.
      
      2018-10-03 02:32:12,731 ERROR [Camel (rpcClient) thread #841 - JmsReplyManagerOnTimeout[OpenNMS.<minion-location>.RPC.SNMP]] o.a.c.p.DefaultErrorHandler: Failed delivery for (MessageId: ID-<opennms-hostname>-46255-1538512560361-0-1142460 on ExchangeId: ID-<opennms-hostname>-46255-1538512560361-0-1133067). Exhausted after delivery attempt: 1 caught: org.apache.camel.ExchangeTimedOutException: The OUT message was not received within: 300000 millis due reply message with correlationID: Camel-ID-<opennms-hostname>-46255-1538512560361-0-1133068 not received on destination: temp-queue://ID:<opennms.hostname>-34049-1538512552976-1:9:1. Exchange[ID-<opennms-hostname>-46255-1538512560361-0-1133067]
      
      Message History
      ---------------------------------------------------------------------------------------------------------------------------------------
      RouteId              ProcessorId          Processor                                                                        Elapsed (ms)
      [rpcClient         ] [rpcClient         ] [direct://executeRpc                                                           ] [    301144]
      [rpcClient         ] [bean1             ] [bean[ref:camelRpcClientPreProcessor]                                          ] [       422]
      [rpcClient         ] [setExchangePattern] [setExchangePattern[InOut]                                                     ] [         0]
      [rpcClient         ] [recipientList1    ] [recipientList[simple{queuingservice:${header.JmsQueueName}}]                  ] [    300722]
      
      Stacktrace
      ---------------------------------------------------------------------------------------------------------------------------------------
      org.apache.camel.ExchangeTimedOutException: The OUT message was not received within: 300000 millis due reply message with correlationID: Camel-ID-<opennms-hostname>-46255-1538512560361-0-1133068 not received on destination: temp-queue://ID:<opennms.hostname>-34049-1538512552976-1:9:1. Exchange[ID-<opennms-hostname>-46255-1538512560361-0-1133067]
              at org.apache.camel.component.jms.reply.ReplyManagerSupport.processReply(ReplyManagerSupport.java:170) [camel-jms-2.19.1.jar:2.19.1]
              at org.apache.camel.component.jms.reply.TemporaryQueueReplyHandler.onTimeout(TemporaryQueueReplyHandler.java:62) [camel-jms-2.19.1.jar:2.19.1]
              at org.apache.camel.component.jms.reply.CorrelationTimeoutMap$1.run(CorrelationTimeoutMap.java:60) [camel-jms-2.19.1.jar:2.19.1]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
      

      As mentioned, I uninstalled 22.0.4 and reinstalled 20.1.0 on this server plus the minions, importing the database from the production environment, and everything worked perfectly with the existing ActiveMQ config.
      After upgrading to 22.0.4 again, I made sure to use the pristine configs, only changing the files we need to collect data from and poll our nodes.

        Attachments

        1. 20181003_collectd_queue.png
          20181003_collectd_queue.png
          18 kB
        2. 20181003_collectd_tasks.png
          20181003_collectd_tasks.png
          36 kB
        3. 20181003_collectd_threads.png
          20181003_collectd_threads.png
          22 kB
        4. 20181004_activemq_external.png
          20181004_activemq_external.png
          37 kB
        5. 20181004_onms-stack.txt
          3.14 MB
        6. 20181004_onms-stack-while-provisioning.txt
          3.06 MB
        7. 20181004_pending_tasks.png
          20181004_pending_tasks.png
          24 kB
        8. 20181004_tasks_completed.png
          20181004_tasks_completed.png
          36 kB
        9. 20181004_tasks_external.png
          20181004_tasks_external.png
          61 kB
        10. 20181010_collectd_pending_tasks.png
          20181010_collectd_pending_tasks.png
          18 kB
        11. 20181010_collectd_threads.png
          20181010_collectd_threads.png
          50 kB
        12. 20181010_collectd_threads.png
          20181010_collectd_threads.png
          50 kB

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            brynjar Brynjar Eide
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: