If a target node is rebooted the RRD/JRB files contains spikes because the sysUpTime check is not working
Description
Acceptance / Success Criteria
Attachments
- 20 Jan 2017, 05:35 PM
- 20 Jan 2017, 01:01 PM
- 20 Jan 2017, 01:01 PM
- 20 Jan 2017, 01:01 PM
- 20 Jan 2017, 11:52 AM
- 20 Jan 2017, 11:52 AM
- 20 Jan 2017, 11:37 AM
Lucidchart Diagrams
Activity
Alejandro Galue February 6, 2017 at 11:53 AM
The PR was merged, and I manually ported the solution into foundation
for Meridian-2015
on c3da2aa47e6a6aecd3fec8b604ab4b7fe5959597
Alejandro Galue February 1, 2017 at 12:48 PM
Actually, the JUnit test wasn't that bad ....
@Test
public void testCounterReset_NMS7106() throws Exception {
// Create Resource
SnmpCollectionAgent agent = createCollectionAgent();
SnmpCollectionResource resource = createNodeResource(agent);
Date baseDate = new Date();
// Add Counter Attribute
String attributeName = "myCounter";
String attributeId = "node[1].resourceType[node].instance[null].metric[" + attributeName + "]";
Map<String, CollectionAttribute> attributes = new HashMap<String, CollectionAttribute>();
BigInteger initialValue = new BigInteger("300");
SnmpAttribute attribute = addAttributeToCollectionResource(resource, attributeName, "counter", "0", initialValue);
attributes.put(attribute.getName(), attribute);
// Get counter value - first time
CollectionResourceWrapper wrapper = createWrapper(resource, attributes, baseDate);
Assert.assertFalse(CollectionResourceWrapper.s_cache.containsKey(attributeId));
Assert.assertEquals(Double.valueOf(Double.NaN), wrapper.getAttributeValue(attributeName)); // Last value is null
Assert.assertEquals(Double.valueOf(Double.NaN), wrapper.getAttributeValue(attributeName)); // Last value is null
Assert.assertEquals(Double.valueOf(initialValue.doubleValue()), CollectionResourceWrapper.s_cache.get(attributeId).getValue());
Assert.assertTrue(wrapper.getAttributeValue(attributeName).isNaN());
// Increase counter
attribute = addAttributeToCollectionResource(resource, attributeName, "counter", "0", new BigInteger("600"));
attributes.put(attribute.getName(), attribute);
wrapper = createWrapper(resource, attributes, new Date(baseDate.getTime() + 300000));
Assert.assertFalse(CollectionResourceWrapper.s_cache.get(attributeId).getValue().isNaN());
Assert.assertEquals(Double.valueOf(300.0), CollectionResourceWrapper.s_cache.get(attributeId).getValue());
Assert.assertEquals(Double.valueOf(1.0), wrapper.getAttributeValue(attributeName)); // 600 - 300 / 300 = 1.0
// Increase counter again
attribute = addAttributeToCollectionResource(resource, attributeName, "counter", "0", new BigInteger("900"));
attributes.put(attribute.getName(), attribute);
wrapper = createWrapper(resource, attributes, new Date(baseDate.getTime() + 600000));
Assert.assertFalse(CollectionResourceWrapper.s_cache.get(attributeId).getValue().isNaN());
Assert.assertEquals(Double.valueOf(600.0), CollectionResourceWrapper.s_cache.get(attributeId).getValue());
Assert.assertEquals(Double.valueOf(1.0), wrapper.getAttributeValue(attributeName)); // 900 - 600 / 300 = 1.0
// Emulate a sysUpTime restart
attribute = addAttributeToCollectionResource(resource, attributeName, "counter", "0", new BigInteger("60"));
attributes.put(attribute.getName(), attribute);
wrapper = createWrapper(resource, attributes, new Date(baseDate.getTime() + 900000));
wrapper.setCounterReset(true);
Assert.assertTrue(wrapper.getAttributeValue(attributeName).isNaN());
// Increase counter again
attribute = addAttributeToCollectionResource(resource, attributeName, "counter", "0", new BigInteger("120"));
attributes.put(attribute.getName(), attribute);
wrapper = createWrapper(resource, attributes, new Date(baseDate.getTime() + 1200000));
Assert.assertEquals(Double.valueOf(0.2), wrapper.getAttributeValue(attributeName)); // 120 - 60 / 300 = 0.2
EasyMock.verify(agent);
}
Without code changes, the test will fail as I explained, and as you've figured out. With the fix, the test pass.
Here is the fix:
diff --git a/opennms-services/src/main/java/org/opennms/netmgt/threshd/CollectionResourceWrapper.java b/opennms-services/src/main/java/org/opennms/netmgt/threshd/CollectionResourceWrapper.java
index 76535ed..2bf46a6 100644
--- a/opennms-services/src/main/java/org/opennms/netmgt/threshd/CollectionResourceWrapper.java
+++ b/opennms-services/src/main/java/org/opennms/netmgt/threshd/CollectionResourceWrapper.java
@@ -421,6 +421,9 @@ public class CollectionResourceWrapper {
LOG.debug("getCounterValue: id={}, last={}, current={}", id, (last==null ? last : last.m_value +"@"+ last.m_timestamp), current);
if (last == null) {
m_localCache.put(id, Double.NaN);
+ if (m_counterReset) {
+ s_cache.put(id, new CacheEntry(m_collectionTimestamp, current));
+ }
LOG.info("getCounterValue: unknown last value for {}, ignoring current", id);
} else {
Double delta = current.doubleValue() - last.m_value.doubleValue();
I'm going to create a PR soon, at least the rest of the threshold tests are not complaining.
Alejandro Galue February 1, 2017 at 11:46 AM
I think I figured where is the problem. Here is my debug process:
1) Identify that the sysUpTime change was detected by the collector:
sysUpTime change detected
2017-02-01 11:25:21,691 DEBUG [Collectd-Thread-28-of-50] o.o.n.c.SnmpCollectionSet: collect: nodeId: 27 interface: 192.168.254.99 sysUpTime: 7066 savedSysUpTime: 59288
2) Verify that the persister won't save the data on RRD files:
skip persistence
017-02-01 11:25:21,693 DEBUG [Collectd-Thread-28-of-50] o.o.n.c.a.AbstractPersister: Persisting node[27].interfaceSnmp[Fa0_1-ca01070b0006].ifInOctets [.1.3.6.1.2.1.2.2.1.10] = 3555 . Ignoring value because of sysUpTim changed.
2017-02-01 11:25:21,693 DEBUG [Collectd-Thread-28-of-50] o.o.n.c.a.AbstractPersister: Persisting node[27].interfaceSnmp[Fa0_1-ca01070b0006].ifInErrors [.1.3.6.1.2.1.2.2.1.14] = 0 . Ignoring value because of sysUpTime changed.
2017-02-01 11:25:21,693 DEBUG [Collectd-Thread-28-of-50] o.o.n.c.a.AbstractPersister: Persisting node[27].interfaceSnmp[Fa0_1-ca01070b0006].ifOutOctets [.1.3.6.1.2.1.2.2.1.16] = 25237 . Ignoring value because of sysUpTime changed.
...
3) On the same collection attempt (i.e. 11:25:xx
), you should check what is happening with the thresholds (not the next one). It is on this timestamp on which the counters should be restarted, and the thresholds will be evaluated against NaN
(because the rate cannot be calculated as expected):
counters restarted
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: applyThresholds: Applying thresholds on node[27].interfaceSnmp[Fa0_1-ca01070b0006] using 7 attributes.
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: applying 1 filters to resource node[27].interfaceSnmp[Fa0_1-ca01070b0006]
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: filter #1: field=ifHighSpeed, regex='^[1-9]+[0-9]*$'
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.CollectionResourceWrapper: getLabelValue: Getting Value for if::ifHighSpeed
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: the value of ifHighSpeed is 100. Pass filter? true
2017-02-01 11:25:21,780 INFO [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: applyThresholds: Processing threshold ifOutOctets * 8 / 1000000 / ifHighSpeed * 100 : {evaluator=high, dsName=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, dsType=if, evaluators=[{ds=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, value=90.0, rearm=75.0, trigger=3}]} on resource node[27].interfaceSnmp[Fa0_1-ca01070b0006]
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.CollectionResourceWrapper: getCounterValue: id=node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifOutOctets], last=null, current=25237.0
2017-02-01 11:25:21,780 INFO [Collectd-Thread-28-of-50] o.o.n.t.CollectionResourceWrapper: getCounterValue: unknown last value for node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifOutOctets], ignoring current
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.CollectionResourceWrapper: getAttributeValue: id=node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifHighSpeed], value= 100.0
2017-02-01 11:25:21,780 INFO [Collectd-Thread-28-of-50] o.o.n.t.ThresholdingSet: applyThresholds: All attributes found for node[27].interfaceSnmp[Fa0_1-ca01070b0006], evaluating
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdEntity: evaluate: value= NaN against threshold: {evaluator=high, dsName=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, dsType=if, evaluators=[{ds=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, value=90.0, rearm=75.0, trigger=3}]}
2017-02-01 11:25:21,780 DEBUG [Collectd-Thread-28-of-50] o.o.n.t.ThresholdEvaluatorHighLow: evaluate: resetting high threshold count to 0
Note that the "current" value for ifIOutOctets
is 25237.0
, and the last value was NaN
as expected (so 25237.0
should be now the current value on the cache).
4) Identify the next collection attempt (I'm collecting every 30 seconds):
next collection attempt
2017-02-01 11:25:53,324 DEBUG [Collectd-Thread-29-of-50] o.o.n.c.SnmpCollectionSet: collect: nodeId: 27 interface: 192.168.254.99 sysUpTime: 10231 savedSysUpTime: -1
Note that last value is -1, so indeed this is the next attempt after the restart was detected.
5) On this collection attempt, check the threshold processing for the same chosen metric:
threshold processing after restart
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: applyThresholds: Applying thresholds on node[27].interfaceSnmp[Fa0_1-ca01070b0006] using 7 attributes.
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: applying 1 filters to resource node[27].interfaceSnmp[Fa0_1-ca01070b0006]
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: filter #1: field=ifHighSpeed, regex='^[1-9]+[0-9]*$'
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.CollectionResourceWrapper: getLabelValue: Getting Value for if::ifHighSpeed
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: passedThresholdFilters: the value of ifHighSpeed is 100. Pass filter? true
2017-02-01 11:25:53,377 INFO [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: applyThresholds: Processing threshold ifOutOctets * 8 / 1000000 / ifHighSpeed * 100 : {evaluator=high, dsName=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, dsType=if, evaluators=[{ds=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, value=90.0, rearm=75.0, trigger=3}]} on resource node[27].interfaceSnmp[Fa0_1-ca01070b0006]
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.CollectionResourceWrapper: getCounterValue: id=node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifOutOctets], last=412769.0@Wed Feb 01 11:23:41 EST 2017, current=48084.0
2017-02-01 11:25:53,377 INFO [Collectd-Thread-29-of-50] o.o.n.t.CollectionResourceWrapper: getCounterValue: node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifOutOctets](counter) wrapped counter adjusted last=412769.0@2017-02-01T11:23:41.425-0500, current=48084.0, olddelta=-364685.0, newdelta=4.294602611E9
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.CollectionResourceWrapper: getCounterValue: id=node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifOutOctets], value=3.30354047E7, delta=4.294602611E9, interval=130
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.CollectionResourceWrapper: getAttributeValue: id=node[27].resourceType[if].instance[Fa0_1-ca01070b0006].metric[ifHighSpeed], value= 100.0
2017-02-01 11:25:53,377 INFO [Collectd-Thread-29-of-50] o.o.n.t.ThresholdingSet: applyThresholds: All attributes found for node[27].interfaceSnmp[Fa0_1-ca01070b0006], evaluating
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdEntity: evaluate: value= 264.2832376 against threshold: {evaluator=high, dsName=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, dsType=if, evaluators=[{ds=ifOutOctets * 8 / 1000000 / ifHighSpeed * 100, value=90.0, rearm=75.0, trigger=3}]}
2017-02-01 11:25:53,377 DEBUG [Collectd-Thread-29-of-50] o.o.n.t.ThresholdEvaluatorHighLow: evaluate: high threshold exceeded, count=1
For some reason, the last value is not what I was expecting. On the logs, it says it was last=412769.0@Wed Feb 01 11:23:41 EST 2017
. The timestamp is correct, but I was expecting 25237.0
, and that's the problem.
In other word, the cache doesn't seem to be properly updated after a "counter reset" is detected, but at least the logic for triggering the "counter reset" is correct.
I'll try to reproduce this on a JUnit test, which is more challenging
Alejandro Galue February 1, 2017 at 10:58 AM
There is something missing from your logs. Probably you don't have the logs available, but as soon as you see the problem again, please share collectd.log for the collector thread in question.
When you see Ignoring value because of sysUpTime changed
, at the same time this happens for the persister, you should see from the threshold processor something like getCounterValue: unknown last value for XXX, ignoring current
. That is what triggers the counter reset and the threshold processor should be ignored (because the last value is considered a NaN).
Details
Assignee
Alejandro GalueAlejandro GalueReporter
Alejandro GalueAlejandro GalueComponents
Priority
Major
Details
Details
Assignee
Reporter
Components
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Several years ago, the SNMP Collector was refactored to track the sysUpTime change on each collection attempt. That functionality was integrated with another feature which is disabled by default, so the sysUpTime check is not being performed.
This feature is send a forceRescan if a change is detected on the ifNumber or the sysUpTime, that for some reasons has been disabled.
The validation of the sysUpTime must be executed no matter if the forceRescan feature is enabled or not for the SNMP Collector, in order to skip the persistence and the threshold processing for counter variables.
Because this is not happening right now, the counter variables may have big spikes due to counter resets, and the threshold processor may generate invalid alarms.