Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-6603

Pollerd stops working if a service is down and the downtime model is not correct.

    XMLWordPrintable

    Details

      Description

      I was working with a customer, and I've found that Pollerd was not doing anything. Nothing was being monitored at customer's side.

      Digging on the logs, I've found the following on uncategorized.log:

      2014-06-16 14:01:09,614 ERROR [Main] OpenNMS.Poller.org.opennms.netmgt.poller.Poller: start: Failed to schedule existing interfaces
      java.lang.RuntimeException: Downtime model is invalid, cannot schedule service 2153:213.187.42.60:BLAH_BLAH
         at org.opennms.netmgt.poller.pollables.PollableServiceConfig.getInterval(PollableServiceConfig.java:222)
         at org.opennms.netmgt.scheduler.Schedule.adjustSchedule(Schedule.java:142)
         at org.opennms.netmgt.poller.pollables.PollableService.updateStatus(PollableService.java:306)
         at org.opennms.netmgt.poller.Poller.scheduleService(Poller.java:591)
         at org.opennms.netmgt.poller.Poller.access$100(Poller.java:72)
         at org.opennms.netmgt.poller.Poller$2.processRow(Poller.java:517)
         at org.opennms.core.utils.Querier.executeStmt(Querier.java:90)
         at org.opennms.core.utils.JDBCTemplate.doExecute(JDBCTemplate.java:96)
         at org.opennms.core.utils.JDBCTemplate.execute(JDBCTemplate.java:68)
         at org.opennms.netmgt.poller.Poller.scheduleMatchingServices(Poller.java:524)
         at org.opennms.netmgt.poller.Poller.scheduleExistingServices(Poller.java:436)
         at org.opennms.netmgt.poller.Poller.onInit(Poller.java:261)
      

      Here is how the package is configured:

      <package name="Blah Blah" remote="false">
             <filter>IPADDR != '0.0.0.0'</filter>
             <include-range begin="1.1.1.1" end="254.254.254.254"/>
             <rrd step="300">
                 <rra>RRA:AVERAGE:0.5:1:2016</rra>
                 <rra>RRA:AVERAGE:0.5:12:1488</rra>
                 <rra>RRA:AVERAGE:0.5:288:366</rra>
                 <rra>RRA:MAX:0.5:288:366</rra>
                 <rra>RRA:MIN:0.5:288:366</rra>
             </rrd>
             <service name="BLAH_BLAH" interval="300000"
                 user-defined="true" status="on">
      ...
             </service>
             <downtime begin="300000" end="43200000" interval="300000"/>
             <downtime begin="43200000" end="432000000" interval="600000"/>
      </package>
      

      NOTE: I've renamed the service and the package to protect the customer's data.

      As you can see, there is no entry that says, how the service is going to be monitored as soon as it is detected as down (i.e. a downtime entry with begin="0").

      The problem is related with a RuntimeException thrown by the method getInterval from org.opennms.netmgt.poller.pollables.PollableServiceConfig:

          public synchronized long getInterval() {
      
              if (m_service.isDeleted())
                  return -1;
      
              long when = m_configService.getInterval();
      
              if (m_service.getStatus().isDown()) {
                  long downSince = m_timer.getCurrentTime() - m_service.getStatusChangeTime();
                  boolean matched = false;
                  for (Downtime dt : m_pkg.getDowntimeCollection()) {
                      if (dt.getBegin() <= downSince) {
                          if (dt.getDelete() != null && (dt.getDelete().equals("yes") || dt.getDelete().equals("true"))) {
                              when = -1;
                              matched = true;
                          }
                          else if (dt.hasEnd() && dt.getEnd() > downSince) {
                              // in this interval
                              //
                              when = dt.getInterval();
                              matched = true;
                          } else // no end
                          {
                              when = dt.getInterval();
                              matched = true;
                          }
                      }
                  }
                  if (!matched) {
                      ThreadCategory.getInstance(getClass()).warn("getInterval: Could not locate downtime model, throwing runtime exception");
                      throw new RuntimeException("Downtime model is invalid, cannot schedule service " + m_service);
                  }
              }
      
              if (when < 0) {
                  m_service.sendDeleteEvent();
              }
      
              return when;
          }
      

      Instead of throwing a RuntimeException we should return a value that makes sense (for example, the configured monitoring interval) when this happen, and put an error message on poller.log (and not uncategorized.log) to let the administrator know about this issue. Or, modify the method to return a known exception that must be catched if there is something wrong.

      The current behavior is that none of the 11000+ services defined across more than 100 packages are not being monitored because of an error with the downtime model on only one package.

      In other words, a silly error on one package is blocking the scheduling to monitor all the the services defined on the database, which is definitely something that should not happen.

        Attachments

          Activity

            People

            • Assignee:
              agalue Alejandro Galue
              Reporter:
              agalue Alejandro Galue
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: