Actively collected metrics suddenly become unavailable through API and Web UI due to static TTL on Newts search index

Description

Symptom

Attributes in Newts search index start disappearing after OpenNMS has been running for longer than the configured TTL (org.opennms.newts.config.ttl

Due to this, at seemingly random intervals (depending on when a node was added and successfully scanned for the first time) graphs will stop working and OpenNMS will start returning 404 errors through the API or "There is no data for this resource" in the Web UI.

Cause

When running OpenNMS with Newts as the timeseries strategy, metrics that are being actively collected will eventually vanish from the REST API and the Web UI, due to

having an identical TTL to the

table.

Samples are correctly expired after the TTL hits 0, but even if OpenNMS has recently collected and inserted new samples, the search index will gradually expire as well. Once the TTL of resource_attributes reaches 0, OpenNMS will still continue inserting new samples, but the REST API and the Web UI will no longer be aware that they exist.

While the samples are continuously inserted at every collection interval, and as such will always contain data with a fresh TTL, resource_attributes are only ever inserted once, and the TTL is never updated, even when rebuilding Newts' MetadataCache after a restart.

No obvious workaround

Once the TTL of entries in resource_attributes expire, the only way to get OpenNMS to rebuild the search index seems to involve a restart of OpenNMS.

Unfortunately, since the TTL of existing entries in resource_attributes isn't updated on a restart, only attributes that have already expired will be recreated. Because of this, immediately after restarting OpenNMS to fix the broken metrics, other attributes may continue to disappear due to the TTL expiring, depending on when OpenNMS originally started collecting them (or when they expired and OpenNMS was restarted last).

There does seem to be a few exceptions to this TTL behaviour: attributes related to telemetryAdapters, sinkConsumerMetrics and sinkProducerMetrics apparently get recreated immediately after the TTL expires.

As far as I can tell, the TTL for resource_attributes currently can't be configured separately from the samples TTL, so there's no easy way to disable it or set it to a TTL far enough in the future that this is unlikely to ever happen:

https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/resources/META-INF/opennms/applicationContext-timeseries-newts.xml#L92

https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/java/org/opennms/netmgt/newts/support/NewtsUtils.java#L64

The only other alternative seems to be to manually truncate the resource_attributes table once the first attributes start disappearing, thereby forcing OpenNMS to recreate the entire search index with a fresh TTL. This would set them all to the same TTL again, as opposed to having to restart OpenNMS every time a new attribute becomes unavailable.

How to reproduce

After installing / compiling OpenNMS from scratch, it's possible to reproduce this issue reliably by tweaking the default configs a bit.

Changes to intervals in collectd and pollerd, to make the issue more visible:
Changes to opennms.properties:
Manually add a node (or run discovery on 127.0.0.1) to start collecting data
Graphs should now be available and continue updating every minute for roughly 15 minutes on http://127.0.0.1:8980/opennms/graph/results.htm?resourceId=node[1].nodeSnmp[]&reports=all
After 15 minutes, all graphs will spontaneously disappear and OpenNMS will report "There is no data for this resource" until the next restart.

After a restart, the graphs should be back again, and OpenNMS will have continued to insert new samples - so other than the time it takes to restart OpenNMS, no data should have lost been regardless of how long OpenNMS reports "There is no data for this resource".

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Brynjar Eide February 18, 2021 at 3:32 PM

Brilliant! I bumped my cache entries' TTL to something like 10 years as a temporary workaround, but I look forward to testing your fix.
Thanks for fixing the bug so quickly, and for the heads up about the fixed version being available.

fooker February 17, 2021 at 3:27 PM

I managed to fix the caching and priming all in the newts project.
If you want to give this a try, you can exchange /opt/opennms/lib/newts-cassandra-search-1.5.2.jar with the updated version from repo (I can also provide a build if required).

Jesse White February 10, 2021 at 2:48 PM

In OpenNMS we use a different implementation of the cache, so would need to consider updating that one as well:
https://github.com/OpenNMS/opennms/blob/opennms-27.0.5-1/features/newts/src/main/java/org/opennms/netmgt/newts/support/GuavaSearchableResourceMetadataCache.java

We also prime the cache when OpenNMS is restarted in order to avoid an influx of writes to the cluster:
https://github.com/OpenNMS/newts/blob/1.5.3/cassandra/search/src/main/java/org/opennms/newts/cassandra/search/CassandraCachePrimer.java

The priming would also need to be updated to consider the remaining TTL so that the re-insert would be performed when needed.

fooker February 10, 2021 at 8:56 AM

That is an awesome bug report and description. Thank you!

The fix above needs some more additional testing.

fooker February 10, 2021 at 8:52 AM

PR: https://github.com/OpenNMS/newts/pull/50

Fixed

Details
Assignee
fooker
Reporter
Brynjar Eide
Components
Sprint
None
Fix versions
Meridian-2019.1.17
Affects versions
27.0.0
Priority
Major

PagerDuty

Created November 30, 2020 at 7:49 PM

Updated March 2, 2021 at 8:37 AM

Resolved March 2, 2021 at 8:37 AM

Actively collected metrics suddenly become unavailable through API and Web UI due to static TTL on Newts search index

Description

Symptom

Cause

No obvious workaround

How to reproduce

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Brynjar Eide February 18, 2021 at 3:32 PM

fooker February 17, 2021 at 3:27 PM

Jesse White February 10, 2021 at 2:48 PM

fooker February 10, 2021 at 8:56 AM

fooker February 10, 2021 at 8:52 AM

DetailsAssigneefookerfookerReporterBrynjar EideBrynjar EideComponentsSprintNone+7Fix versionsMeridian-2019.1.17Affects versions27.0.0PriorityMajor

Details

Assignee

Reporter

Components

Sprint

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
fooker
Reporter
Brynjar Eide
Components
Sprint
None
Fix versions
Meridian-2019.1.17
Affects versions
27.0.0
Priority
Major

PagerDuty