Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-9413

Collectd is scheduling collection for every IP address instead of just primary interfaces

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 20.0.0
    • Fix Version/s: 20.0.1
    • Component/s: Data Collection - SNMP
    • Security Level: Default (Default Security Scheme)
    • Labels:
      None

      Description

      On a test lab, I was testing Horizon 20 as I usually do when there is a new release, and I discovered the following problem:

      OpenNMS is scheduling SNMP data collection for every single IP/service pair that matches a collection package regardless of the isSnmpPrimary flag on the ipInterface table.

      The above is true no matter if the nodes are behind a Minion or directly managed by OpenNMS.

      This might affect other services, but I've only noted this on SNMP, which is the one that for sure should use the concept of the primary interface.

      This never happened before, as until H19, the SNMP Collector (and Collectd in general) was respecting the proper scheduling of the services using the primary IP address.

      1. collectd.log.gz
        421 kB
        Alejandro Galue
      2. horizon20.pgdump
        225 kB
        Alejandro Galue
      3. horizon20.tar.gz
        2.37 MB
        Alejandro Galue

        Issue Links

          Activity

          Hide
          agalue Alejandro Galue added a comment - - edited

          There are 10 nodes on my database:

          • One Minion
          • 8 Cisco Routers from my GNS3 Lab
          • The OpenNMS server itself

          The total amount of IP interfaces on the database is 25 (including the secondary IPs discovered through SNMP from the routers).

          [root@onms-stable logs]# grep "collection, scheduled" collectd.log.init  | wc -l
          22
          
          [root@onms-stable logs]# psql -U opennms -d opennms -c "select count(*) from ipinterface where issnmpprimary='P'"
           count 
          -------
               9
          (1 row)
          
          [root@onms-stable logs]# psql -U opennms -d opennms -c "select count(*) from ipinterface"
           count 
          -------
              25
          (1 row)
          

          collection.log.init contains only the initialization logs with DEBUG (i.e. when Collectd is starting and scheduling everything). I've created a copy because the amount of errors is so big that the logs rolls out quickly.

          I tried the above putting all the Cisco routers behind the Minion, and then started from scratch ignoring the Minion (i.e. collecting everything from OpenNMS directly).

          Show
          agalue Alejandro Galue added a comment - - edited There are 10 nodes on my database: One Minion 8 Cisco Routers from my GNS3 Lab The OpenNMS server itself The total amount of IP interfaces on the database is 25 (including the secondary IPs discovered through SNMP from the routers). [root@onms-stable logs]# grep "collection, scheduled" collectd.log.init | wc -l 22 [root@onms-stable logs]# psql -U opennms -d opennms -c "select count(*) from ipinterface where issnmpprimary='P'" count ------- 9 (1 row) [root@onms-stable logs]# psql -U opennms -d opennms -c "select count(*) from ipinterface" count ------- 25 (1 row) collection.log.init contains only the initialization logs with DEBUG (i.e. when Collectd is starting and scheduling everything). I've created a copy because the amount of errors is so big that the logs rolls out quickly. I tried the above putting all the Cisco routers behind the Minion, and then started from scratch ignoring the Minion (i.e. collecting everything from OpenNMS directly).
          Hide
          agalue Alejandro Galue added a comment -

          I also discover that when a node is unreachable, not necessarily I get a dataCollectionSucceeded when the node is reachable again.

          In other words, there are going to be zombie dataCollectionFailed alarms and you have to manually clear them (which never happened before).

          Show
          agalue Alejandro Galue added a comment - I also discover that when a node is unreachable, not necessarily I get a dataCollectionSucceeded when the node is reachable again. In other words, there are going to be zombie dataCollectionFailed alarms and you have to manually clear them (which never happened before).
          Hide
          agalue Alejandro Galue added a comment -

          Besides the obvious waste of precious resources in Collectd, I'm seeing LOTs of errors in collectd.log that ends up with:

          2017-06-09 14:52:13,449 ERROR [Collectd-Thread-32-of-50] o.o.n.c.CollectableService: Unexpected error during node SNMP collection for: 100.0.0.1
          org.opennms.netmgt.collection.api.CollectionException: Unexpected error during node SNMP collection for: 100.0.0.1
                  at org.opennms.netmgt.collectd.SnmpCollector.collect(SnmpCollector.java:261) ~[opennms-services-20.0.0.jar:?]
                  at org.opennms.netmgt.collection.client.rpc.CollectorClientRpcModule$1.get(CollectorClientRpcModule.java:83) ~[org.opennms.features.collection.client-rpc-20.0.0.jar:?]
                  at org.opennms.netmgt.collection.client.rpc.CollectorClientRpcModule$1.get(CollectorClientRpcModule.java:77) ~[org.opennms.features.collection.client-rpc-20.0.0.jar:?]
                  at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_131]
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
                  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
          Caused by: org.opennms.netmgt.collection.api.CollectionInitializationException: Interface 100.0.0.1 is not the primary SNMP interface for nodeid 14
                  at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateIsSnmpPrimary(DefaultCollectionAgent.java:301) ~[opennms-services-20.0.0.jar:?]
                  at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateAgent(DefaultCollectionAgent.java:324) ~[opennms-services-20.0.0.jar:?]
                  at org.opennms.netmgt.collectd.SnmpCollector.collect(SnmpCollector.java:220) ~[opennms-services-20.0.0.jar:?]
                  ... 6 more
          

          The above errors is shown on every collection attempt, and that's related with the schedulers associated with non-primary IPs, and those messages are filling out collectd.log quickly.

          Show
          agalue Alejandro Galue added a comment - Besides the obvious waste of precious resources in Collectd, I'm seeing LOTs of errors in collectd.log that ends up with: 2017-06-09 14:52:13,449 ERROR [Collectd- Thread -32-of-50] o.o.n.c.CollectableService: Unexpected error during node SNMP collection for : 100.0.0.1 org.opennms.netmgt.collection.api.CollectionException: Unexpected error during node SNMP collection for : 100.0.0.1 at org.opennms.netmgt.collectd.SnmpCollector.collect(SnmpCollector.java:261) ~[opennms-services-20.0.0.jar:?] at org.opennms.netmgt.collection.client.rpc.CollectorClientRpcModule$1.get(CollectorClientRpcModule.java:83) ~[org.opennms.features.collection.client-rpc-20.0.0.jar:?] at org.opennms.netmgt.collection.client.rpc.CollectorClientRpcModule$1.get(CollectorClientRpcModule.java:77) ~[org.opennms.features.collection.client-rpc-20.0.0.jar:?] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131] at java.lang. Thread .run( Thread .java:748) [?:1.8.0_131] Caused by: org.opennms.netmgt.collection.api.CollectionInitializationException: Interface 100.0.0.1 is not the primary SNMP interface for nodeid 14 at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateIsSnmpPrimary(DefaultCollectionAgent.java:301) ~[opennms-services-20.0.0.jar:?] at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateAgent(DefaultCollectionAgent.java:324) ~[opennms-services-20.0.0.jar:?] at org.opennms.netmgt.collectd.SnmpCollector.collect(SnmpCollector.java:220) ~[opennms-services-20.0.0.jar:?] ... 6 more The above errors is shown on every collection attempt, and that's related with the schedulers associated with non-primary IPs, and those messages are filling out collectd.log quickly.
          Hide
          agalue Alejandro Galue added a comment -

          If I check the Instrumentation Logger Report, you can easily see that there are more than 9 different entries (based on my lab), and some of them (as expected) are related with the same node, which can confirm the suspicions about the incorrect scheduling process.

          Show
          agalue Alejandro Galue added a comment - If I check the Instrumentation Logger Report, you can easily see that there are more than 9 different entries (based on my lab), and some of them (as expected) are related with the same node, which can confirm the suspicions about the incorrect scheduling process.
          Hide
          j-white Jesse White added a comment -

          Collectd attempts to schedule all interfaces that match the package definition.

          Before 20.0.0, a call to ServiceCollector#initialize(CollectionAgent agent, Map<String, Object> parameters) was made prior to scheduling the interface. Aside from performing any interface specific initialization, this also allowed the ServiceCollector to "reject" the scheduling by throwing a CollectionInitializationException.

          In 20.0.0 while refactoring the ServiceCollector interface to work with Minion, we removed the call to initialize() while scheduling the interface.

          This change caused the observed behavior, which is that all IP interfaces will be scheduled, and a exception will be thrown (and an error will be logged) when collection is attempted using the SnmpCollector on a non-primary interface.

          Show
          j-white Jesse White added a comment - Collectd attempts to schedule all interfaces that match the package definition. Before 20.0.0, a call to ServiceCollector#initialize(CollectionAgent agent, Map<String, Object> parameters) was made prior to scheduling the interface. Aside from performing any interface specific initialization, this also allowed the ServiceCollector to "reject" the scheduling by throwing a CollectionInitializationException. In 20.0.0 while refactoring the ServiceCollector interface to work with Minion, we removed the call to initialize() while scheduling the interface. This change caused the observed behavior, which is that all IP interfaces will be scheduled, and a exception will be thrown (and an error will be logged) when collection is attempted using the SnmpCollector on a non-primary interface.
          Show
          j-white Jesse White added a comment - PR: https://github.com/OpenNMS/opennms/pull/1544

            People

            • Assignee:
              j-white Jesse White
              Reporter:
              agalue Alejandro Galue
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development