Details
-
Bug
-
Status: Resolved (View Workflow)
-
Major
-
Resolution: Fixed
-
27.0.0
-
Security Level: Default (Default Security Scheme)
-
None
-
Horizon 2020 - Nov 24-Dec 9, Horizon 2020 - Dec 9 -Dec 23, Horizon 2020 - Dec 23 - Jan 6, Horizon 2020 - Jan 6 - Jan 20, Horizon 2021 - Jan 20 - Feb 3, Horizon 2021 - Feb 3 - Feb 17, Horizon 2021 - Feb 17 - Mar 3
Description
Symptom
Attributes in Newts search index start disappearing after OpenNMS has been running for longer than the configured TTL (org.opennms.newts.config.ttl
Due to this, at seemingly random intervals (depending on when a node was added and successfully scanned for the first time) graphs will stop working and OpenNMS will start returning 404 errors through the API or "There is no data for this resource" in the Web UI.
Cause
When running OpenNMS with Newts as the timeseries strategy, metrics that are being actively collected will eventually vanish from the REST API and the Web UI, due to
newts.resource_attributes
having an identical TTL to the
newts.samples
table.
Samples are correctly expired after the TTL hits 0, but even if OpenNMS has recently collected and inserted new samples, the search index will gradually expire as well. Once the TTL of resource_attributes reaches 0, OpenNMS will still continue inserting new samples, but the REST API and the Web UI will no longer be aware that they exist.
While the samples are continuously inserted at every collection interval, and as such will always contain data with a fresh TTL, resource_attributes are only ever inserted once, and the TTL is never updated, even when rebuilding Newts' MetadataCache after a restart.
No obvious workaround
Once the TTL of entries in resource_attributes expire, the only way to get OpenNMS to rebuild the search index seems to involve a restart of OpenNMS.
Unfortunately, since the TTL of existing entries in resource_attributes isn't updated on a restart, only attributes that have already expired will be recreated. Because of this, immediately after restarting OpenNMS to fix the broken metrics, other attributes may continue to disappear due to the TTL expiring, depending on when OpenNMS originally started collecting them (or when they expired and OpenNMS was restarted last).
There does seem to be a few exceptions to this TTL behaviour: attributes related to telemetryAdapters, sinkConsumerMetrics and sinkProducerMetrics apparently get recreated immediately after the TTL expires.
As far as I can tell, the TTL for resource_attributes currently can't be configured separately from the samples TTL, so there's no easy way to disable it or set it to a TTL far enough in the future that this is unlikely to ever happen:
The only other alternative seems to be to manually truncate the resource_attributes table once the first attributes start disappearing, thereby forcing OpenNMS to recreate the entire search index with a fresh TTL. This would set them all to the same TTL again, as opposed to having to restart OpenNMS every time a new attribute becomes unavailable.
How to reproduce
After installing / compiling OpenNMS from scratch, it's possible to reproduce this issue reliably by tweaking the default configs a bit.
- Changes to intervals in collectd and pollerd, to make the issue more visible:
sed -i 's/interval="300000"/interval="60000"/' poller-configuration.xml collectd-configuration.xml
- Changes to opennms.properties:
# $ grep -hv ^# opennms.properties.d/*.properties org.opennms.timeseries.strategy=newts org.opennms.newts.config.ttl=900 org.opennms.newts.query.minimum_step=60000 org.opennms.newts.query.heartbeat=90000 org.opennms.web.defaultGraphPeriod=last_1_hour
- Manually add a node (or run discovery on 127.0.0.1) to start collecting data
- Graphs should now be available and continue updating every minute for roughly 15 minutes on http://127.0.0.1:8980/opennms/graph/results.htm?resourceId=node[1].nodeSnmp[]&reports=all
- After 15 minutes, all graphs will spontaneously disappear and OpenNMS will report "There is no data for this resource" until the next restart.
After a restart, the graphs should be back again, and OpenNMS will have continued to insert new samples - so other than the time it takes to restart OpenNMS, no data should have lost been regardless of how long OpenNMS reports "There is no data for this resource".