Provisiond temporarily deletes policy-based surveillance categories from existing nodes when synchronizing

Description

I make heavy use of surveillance categories, and I primarily use provisioning group policies to control category membership on nodes.

A common workflow for me is to add a new node to an existing provisioning group that has existing nodes in it. When I do this, I've noticed that all of the policy-defined category memberships of nodes that are already in the group disappear as provisiond goes through some kind of rescan/reimport process on the existing nodes. Since it takes a while to complete the scan + import process, it can be minutes where an existing node will lose the majority of its categories. Obviously this is a problem for anything using those categories (and I use a lot of category-based filters).

I have also noticed that per-node category memberships defined in the requisition, rather than via a group-wide foreign source policy, are unaffected.

Steps to reproduce:

  • Create a two surveillance categories "TestPolicyCategory" and "TestRequisitionCategory"

  • Create the following provisioning group "TestGroup" and synchronize it:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<foreign-source date-stamp="2011-11-17T12:52:38.522-06:00" name="TestGroup" xmlns="http://xmlns.opennms.org/xsd/config/foreign-source">
<scan-interval>1d</scan-interval>
<detectors>
<detector class="org.opennms.netmgt.provision.detector.icmp.IcmpDetector" name="ICMP"/>
<detector class="org.opennms.netmgt.provision.detector.snmp.SnmpDetector" name="SNMP"/>
</detectors>
<policies>
<policy class="org.opennms.netmgt.provision.persist.policies.NodeCategorySettingPolicy" name="SetCategory">
<parameter value="TestPolicyCategory" key="category"/>
<parameter value="ALL_PARAMETERS" key="matchBehavior"/>
</policy>
</policies>
</foreign-source>

  • Find a node that takes a little while to scan (SNMP-enabled Windows boxes with SNMP Informant seem to work well in our environment). Use the "Add Node" link in the WebUI to provision this node as "TestNode1" into the provisioning group "TestGroup". Assign the additional category "TestRequisitionCategory" to the node via the dropdowns available through this UI.

  • Go to the node page of the newly provisioned node TestNode1, and refresh until the Surveillance Category Memberships section shows both "TestRequsitionCategory" and "TestPolicyCategory".

  • Add another node to the test provisioning group (or you may simply be able to just hit synchronize again)

  • (BUG) Hit refresh on the node page of TestNode1 and watch as TestPolicyCategory disappears for the duration of the provisiond scan/import process (starting with the nodeUpdated event), then reappears once the nodeScanCompleted event occurs. However, TestRequisitionCategory is unaffected.

Expected behavior would be for the node to retain its existing policy-based category memberships during the scan/import process, and have any adjustments take place only once the process is complete.

Environment

CentOS 5.6 Oracle JDK 7u0

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Benjamin Reed September 22, 2014 at 10:08 PM

Provisiond has been fixed to not delete and add categories in phases, which repairs the issues with categories being removed and re-added.

For details on the new design, see:

https://github.com/OpenNMS/opennms/blob/rc/stable/1.14.0/opennms-provision/opennms-provisiond/design.markdown#category-lifecycle

Benjamin Reed September 19, 2014 at 1:00 PM

Because these things happen in 2 passes (the requisition processing, and then potentially much later, a node scan) there's a race condition between the creation of requisitioned categories on a node and the final accounting that happens after policies are applied during the node scan.

The solution is to "keep state" through the process so we know the difference between categories that are applied from the requisition import/scan and categories that come from elsewhere.

I have a fix for this and am working on finishing up tests now.

Alexander Hoogerhuis April 17, 2014 at 8:52 AM

In my simplistic "end user brian", would this not be just a case of comparing the set of existing categories and the new categories, and then just chucking out those that are not in the new set?

Benjamin Reed April 17, 2014 at 8:47 AM

Yeah, you're right. This is another thing that's just not solvable without fixing Provisiond to handle node "deltas".

I've reverted the commit for now.

Alexander Hoogerhuis April 17, 2014 at 12:13 AM

I just tested the code in 1.12 with b1ec50f07b0e5f0eda95130db727cd4550e5d54f added, and it now keeps adding labels, but does not remove labels that are no longer present in a provisioning policy.

Fixed

Details

Assignee

Reporter

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

Created November 17, 2011 at 2:04 PM
Updated September 22, 2014 at 10:08 PM
Resolved September 22, 2014 at 10:08 PM