Actively collected metrics suddenly become unavailable through API and Web UI due to static TTL on Newts search index
Description
Acceptance / Success Criteria
Lucidchart Diagrams
Activity

Brynjar Eide February 18, 2021 at 3:32 PM
Brilliant! I bumped my cache entries' TTL to something like 10 years as a temporary workaround, but I look forward to testing your fix.
Thanks for fixing the bug so quickly, and for the heads up about the fixed version being available.

fooker February 17, 2021 at 3:27 PM
I managed to fix the caching and priming all in the newts project.
If you want to give this a try, you can exchange /opt/opennms/lib/newts-cassandra-search-1.5.2.jar
with the updated version from repo (I can also provide a build if required).

Jesse White February 10, 2021 at 2:48 PM
In OpenNMS we use a different implementation of the cache, so would need to consider updating that one as well:
https://github.com/OpenNMS/opennms/blob/opennms-27.0.5-1/features/newts/src/main/java/org/opennms/netmgt/newts/support/GuavaSearchableResourceMetadataCache.java
We also prime the cache when OpenNMS is restarted in order to avoid an influx of writes to the cluster:
https://github.com/OpenNMS/newts/blob/1.5.3/cassandra/search/src/main/java/org/opennms/newts/cassandra/search/CassandraCachePrimer.java
The priming would also need to be updated to consider the remaining TTL so that the re-insert would be performed when needed.

fooker February 10, 2021 at 8:56 AM
That is an awesome bug report and description. Thank you!
The fix above needs some more additional testing.
Details
Assignee
fookerfookerReporter
Brynjar EideBrynjar EideComponents
Sprint
NoneFix versions
Affects versions
Priority
Major
Details
Details
Assignee

Reporter

Components
Sprint
Fix versions
Affects versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Symptom
Attributes in Newts search index start disappearing after OpenNMS has been running for longer than the configured TTL (
org.opennms.newts.config.ttl
Due to this, at seemingly random intervals (depending on when a node was added and successfully scanned for the first time) graphs will stop working and OpenNMS will start returning 404 errors through the API or "
There is no data for this resource
" in the Web UI.Cause
When running OpenNMS with Newts as the timeseries strategy, metrics that are being actively collected will eventually vanish from the REST API and the Web UI, due to
having an identical TTL to the
table.
Samples are correctly expired after the TTL hits 0, but even if OpenNMS has recently collected and inserted new samples, the search index will gradually expire as well. Once the TTL of
resource_attributes
reaches 0, OpenNMS will still continue inserting new samples, but the REST API and the Web UI will no longer be aware that they exist.While the
samples
are continuously inserted at every collection interval, and as such will always contain data with a fresh TTL,resource_attributes
are only ever inserted once, and the TTL is never updated, even when rebuilding Newts' MetadataCache after a restart.No obvious workaround
Once the TTL of entries in
resource_attributes
expire, the only way to get OpenNMS to rebuild the search index seems to involve a restart of OpenNMS.Unfortunately, since the TTL of existing entries in
resource_attributes
isn't updated on a restart, only attributes that have already expired will be recreated. Because of this, immediately after restarting OpenNMS to fix the broken metrics, other attributes may continue to disappear due to the TTL expiring, depending on when OpenNMS originally started collecting them (or when they expired and OpenNMS was restarted last).There does seem to be a few exceptions to this TTL behaviour: attributes related to
telemetryAdapters
,sinkConsumerMetrics
andsinkProducerMetrics
apparently get recreated immediately after the TTL expires.As far as I can tell, the TTL for
resource_attributes
currently can't be configured separately from the samples TTL, so there's no easy way to disable it or set it to a TTL far enough in the future that this is unlikely to ever happen:https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/resources/META-INF/opennms/applicationContext-timeseries-newts.xml#L92
https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/java/org/opennms/netmgt/newts/support/NewtsUtils.java#L64
The only other alternative seems to be to manually truncate the
resource_attributes
table once the first attributes start disappearing, thereby forcing OpenNMS to recreate the entire search index with a fresh TTL. This would set them all to the same TTL again, as opposed to having to restart OpenNMS every time a new attribute becomes unavailable.How to reproduce
After installing / compiling OpenNMS from scratch, it's possible to reproduce this issue reliably by tweaking the default configs a bit.
Changes to intervals in collectd and pollerd, to make the issue more visible:
Changes to opennms.properties:
Manually add a node (or run discovery on 127.0.0.1) to start collecting data
Graphs should now be available and continue updating every minute for roughly 15 minutes on http://127.0.0.1:8980/opennms/graph/results.htm?resourceId=node[1].nodeSnmp[]&reports=all
After 15 minutes, all graphs will spontaneously disappear and OpenNMS will report "
There is no data for this resource
" until the next restart.After a restart, the graphs should be back again, and OpenNMS will have continued to insert new samples - so other than the time it takes to restart OpenNMS, no data should have lost been regardless of how long OpenNMS reports "
There is no data for this resource
".