The downtime model never removes the nodes when it is instructed to do it
Description
Typically, the default downtime model for Pollerd is configured to remove a node that has been down for more than 5 days.
Of course, this option makes more sense for discovered nodes than requisitioned nodes.
I've created a simple testing environment on a VM running 14.0.1 with the following changes on the configuration:
1) Change the polling frequency to be 30 seconds for all the services in poller-configuration.xml
2) Remove all the downtime model entries and add the following:
BTW, a "valid" downtime model must start with 0 (i.e. begin=“0”). It could have several entries in the middle (with begin/end/interval where the “begin” attribute of each entry should match the “end” attribute of the entry above of it), and then the last one should be either delete the node (like the example), or just continue checking the service at certain interval (i.e., an entry with “begin” and “interval” no “end” or “delete”). If this is not correct, the downtime model will be rejected and ignored.
3) Start OpenNMS.
4) Add a new through the newSuspect event (i.e. discover a node). I've used another VM as a target node.
5) Wait a few minutes to verify that the node is being monitored properly.
6) Stop the VM that is being monitored.
7) Wait more than 5 minutes.
Expected result:
The node should be removed automatically from the database.
Current result:
The node is still on the DB after 15 minutes (it is never removed, or marked to be removed). But, all the monitored services have been removed as part of the downtime model, but the empty IP interface and the node itself are never removed from the DB (check the screenshot).
In other words, it is partially working.
Also, the time between the nodeDown and the service deletion is more than 5 minutes for some services which is not expected as well. I mean, some services are requested to be removed 5 minutes after the nodeDown, but the rest of them are requested to be removed 10 minutes after the nodeDown.
Acceptance / Success Criteria
None
Attachments
2
Lucidchart Diagrams
Activity
Show:
Alejandro Galue December 10, 2014 at 10:35 AM
Despite the fact, that if you restart OpenNMS, the downtime model counters are restarted as well, the core problem of this issue have been resolved.
Alejandro Galue December 9, 2014 at 6:11 PM
Yes, it works
Interesting fact: if a node has been down for enough time to be deleted because of the downtime model, and then you've realized that you forgot to set enableDeletionOfRequisitionedEntities to true. I would expect that if I change the flag and restart, the node will be deleted immediately. Unfortunately, that is not true, you must wait the whole time again in order to see that the node is deleted because of the downtime model. It doesn't count the time the node have been down. I understand why that would happen, but I'm not sure if the users would understand that as well. But at least this is better than never do it, I guess
Alejandro Galue December 9, 2014 at 5:42 PM
Good point!
Trying that now.
Benjamin Reed December 9, 2014 at 5:40 PM
Did you set org.opennms.provisiond.enableDeletionOfRequisitionedEntities=true in opennms.properties? It won't delete anything that's defined in a requisition if that's false.
Alejandro Galue December 8, 2014 at 5:49 PM
Ben, I've recompiled the branch again (after 03d35b064858c962d85c585bf66fc8066d26c3c7), and I was able to validate that the discovered node is deleted because of the downtime model
But, the requisitioned node remain untouched despite the downtime model
Typically, the default downtime model for Pollerd is configured to remove a node that has been down for more than 5 days.
Of course, this option makes more sense for discovered nodes than requisitioned nodes.
I've created a simple testing environment on a VM running 14.0.1 with the following changes on the configuration:
1) Change the polling frequency to be 30 seconds for all the services in poller-configuration.xml
2) Remove all the downtime model entries and add the following:
BTW, a "valid" downtime model must start with 0 (i.e. begin=“0”). It could have several entries in the middle (with begin/end/interval where the “begin” attribute of each entry should match the “end” attribute of the entry above of it), and then the last one should be either delete the node (like the example), or just continue checking the service at certain interval (i.e., an entry with “begin” and “interval” no “end” or “delete”). If this is not correct, the downtime model will be rejected and ignored.
3) Start OpenNMS.
4) Add a new through the newSuspect event (i.e. discover a node). I've used another VM as a target node.
5) Wait a few minutes to verify that the node is being monitored properly.
6) Stop the VM that is being monitored.
7) Wait more than 5 minutes.
Expected result:
The node should be removed automatically from the database.
Current result:
The node is still on the DB after 15 minutes (it is never removed, or marked to be removed). But, all the monitored services have been removed as part of the downtime model, but the empty IP interface and the node itself are never removed from the DB (check the screenshot).
In other words, it is partially working.
Also, the time between the nodeDown and the service deletion is more than 5 minutes for some services which is not expected as well. I mean, some services are requested to be removed 5 minutes after the nodeDown, but the rest of them are requested to be removed 10 minutes after the nodeDown.