Uploaded image for project: 'OpenNMS'
  1. OpenNMS
  2. NMS-13029

Actively collected metrics suddenly become unavailable through API and Web UI due to static TTL on Newts search index

    XMLWordPrintable

    Details

    • Sprint:
      Horizon 2020 - Nov 24-Dec 9, Horizon 2020 - Dec 9 -Dec 23, Horizon 2020 - Dec 23 - Jan 6, Horizon 2020 - Jan 6 - Jan 20, Horizon 2021 - Jan 20 - Feb 3, Horizon 2021 - Feb 3 - Feb 17, Horizon 2021 - Feb 17 - Mar 3

      Description

      Symptom

      Attributes in Newts search index start disappearing after OpenNMS has been running for longer than the configured TTL (org.opennms.newts.config.ttl

      Due to this, at seemingly random intervals (depending on when a node was added and successfully scanned for the first time) graphs will stop working and OpenNMS will start returning 404 errors through the API or "There is no data for this resource" in the Web UI.

      Cause

      When running OpenNMS with Newts as the timeseries strategy, metrics that are being actively collected will eventually vanish from the REST API and the Web UI, due to

      newts.resource_attributes

      having an identical TTL to the

      newts.samples

      table.

      Samples are correctly expired after the TTL hits 0, but even if OpenNMS has recently collected and inserted new samples, the search index will gradually expire as well. Once the TTL of resource_attributes reaches 0, OpenNMS will still continue inserting new samples, but the REST API and the Web UI will no longer be aware that they exist.

      While the samples are continuously inserted at every collection interval, and as such will always contain data with a fresh TTL, resource_attributes are only ever inserted once, and the TTL is never updated, even when rebuilding Newts' MetadataCache after a restart.

      No obvious workaround

      Once the TTL of entries in resource_attributes expire, the only way to get OpenNMS to rebuild the search index seems to involve a restart of OpenNMS.

      Unfortunately, since the TTL of existing entries in resource_attributes isn't updated on a restart, only attributes that have already expired will be recreated. Because of this, immediately after restarting OpenNMS to fix the broken metrics, other attributes may continue to disappear due to the TTL expiring, depending on when OpenNMS originally started collecting them (or when they expired and OpenNMS was restarted last).

      There does seem to be a few exceptions to this TTL behaviour: attributes related to telemetryAdapters, sinkConsumerMetrics and sinkProducerMetrics apparently get recreated immediately after the TTL expires.

      As far as I can tell, the TTL for resource_attributes currently can't be configured separately from the samples TTL, so there's no easy way to disable it or set it to a TTL far enough in the future that this is unlikely to ever happen:

      https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/resources/META-INF/opennms/applicationContext-timeseries-newts.xml#L92

      https://github.com/OpenNMS/opennms/blob/686246224b53698aab1c39f9bf1ef648cd563f4a/features/newts/src/main/java/org/opennms/netmgt/newts/support/NewtsUtils.java#L64

      The only other alternative seems to be to manually truncate the resource_attributes table once the first attributes start disappearing, thereby forcing OpenNMS to recreate the entire search index with a fresh TTL. This would set them all to the same TTL again, as opposed to having to restart OpenNMS every time a new attribute becomes unavailable.

      How to reproduce

      After installing / compiling OpenNMS from scratch, it's possible to reproduce this issue reliably by tweaking the default configs a bit.

      1. Changes to intervals in collectd and pollerd, to make the issue more visible:
        sed -i 's/interval="300000"/interval="60000"/' poller-configuration.xml collectd-configuration.xml
        
      2. Changes to opennms.properties:
        # $ grep -hv ^# opennms.properties.d/*.properties
        org.opennms.timeseries.strategy=newts
        org.opennms.newts.config.ttl=900
        org.opennms.newts.query.minimum_step=60000
        org.opennms.newts.query.heartbeat=90000
        org.opennms.web.defaultGraphPeriod=last_1_hour
        
      3. Manually add a node (or run discovery on 127.0.0.1) to start collecting data
      4. Graphs should now be available and continue updating every minute for roughly 15 minutes on http://127.0.0.1:8980/opennms/graph/results.htm?resourceId=node[1].nodeSnmp[]&reports=all
      5. After 15 minutes, all graphs will spontaneously disappear and OpenNMS will report "There is no data for this resource" until the next restart.

      After a restart, the graphs should be back again, and OpenNMS will have continued to insert new samples - so other than the time it takes to restart OpenNMS, no data should have lost been regardless of how long OpenNMS reports "There is no data for this resource".

        Attachments

          Activity

            People

            Assignee:
            fooker Dustin Frisch
            Reporter:
            brynjar Brynjar Eide
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Git Integration