Pollerd stops working if a service is down and the downtime model is not correct.
Description
I was working with a customer, and I've found that Pollerd was not doing anything. Nothing was being monitored at customer's side.
Digging on the logs, I've found the following on uncategorized.log:
Here is how the package is configured:
NOTE: I've renamed the service and the package to protect the customer's data.
As you can see, there is no entry that says, how the service is going to be monitored as soon as it is detected as down (i.e. a downtime entry with begin="0").
The problem is related with a RuntimeException thrown by the method getInterval from org.opennms.netmgt.poller.pollables.PollableServiceConfig:
Instead of throwing a RuntimeException we should return a value that makes sense (for example, the configured monitoring interval) when this happen, and put an error message on poller.log (and not uncategorized.log) to let the administrator know about this issue. Or, modify the method to return a known exception that must be catched if there is something wrong.
The current behavior is that none of the 11000+ services defined across more than 100 packages are not being monitored because of an error with the downtime model on only one package.
In other words, a silly error on one package is blocking the scheduling to monitor all the the services defined on the database, which is definitely something that should not happen.
Acceptance / Success Criteria
None
Lucidchart Diagrams
Activity
Show:
Alejandro Galue June 30, 2014 at 11:10 AM
Fixed on revision 6cc7f4f80cf376c3e2315c9aa58612996bc2ac2a for 1.12
I was working with a customer, and I've found that Pollerd was not doing anything. Nothing was being monitored at customer's side.
Digging on the logs, I've found the following on uncategorized.log:
Here is how the package is configured:
NOTE: I've renamed the service and the package to protect the customer's data.
As you can see, there is no entry that says, how the service is going to be monitored as soon as it is detected as down (i.e. a downtime entry with begin="0").
The problem is related with a RuntimeException thrown by the method getInterval from org.opennms.netmgt.poller.pollables.PollableServiceConfig:
Instead of throwing a RuntimeException we should return a value that makes sense (for example, the configured monitoring interval) when this happen, and put an error message on poller.log (and not uncategorized.log) to let the administrator know about this issue. Or, modify the method to return a known exception that must be catched if there is something wrong.
The current behavior is that none of the 11000+ services defined across more than 100 packages are not being monitored because of an error with the downtime model on only one package.
In other words, a silly error on one package is blocking the scheduling to monitor all the the services defined on the database, which is definitely something that should not happen.