Race condition when enabling the Situations Feedback feature

Description

OpenNMS has multiple optional features installed via Karaf that require configuration to work.

Usually, and because I always use automation tools to configure the solutions for customers and when building test environments, it is an unattended script responsible for configuring everything.

All the features I tried in the past can be configured this way, and never had issues with them.

Unfortunately, that is not the case with Situations Feedback. After a clean installation (i.e., nothing inside /opt/opennms/data), the feature tries to start but it doesn't seem to work, and I can see the following on the Karaf shell:

admin@opennms> health-check Verifying the health of the container Verifying installed bundles [ Success ] ALEC :: Driver [ Success ] => Tick duration (99 percentile): 22 ms Connecting to ElasticSearch ReST API (Flows) [ Success ] => Not configured Number of active alarms stored in Elasticsearch (Alarm History) [ Success ] => Found 3 alarms. Connecting to ElasticSearch ReST API (Situation Feedback) [ Timeout ] => Health Check did not finish within 5000 ms

This Situations Feedback plugin is configured very similarly to the Elasticsearch forwarders for Events and Alarm History. These two always work when setting them automatically (never had issues with them or any other feature), but I haven't found a way to start the Situations Feedback in a similar way that always works.

The following is the easiest fix I found that always works for me:

#!/bin/bash ssh -p 8101 admin@localhost "\ config:edit org.opennms.features.situation-feedback.persistence.elastic; config:property-set elasticUrl "https://elastic:9200"; config:property-set globalElasticUser "elastic" config:property-set globalElasticPassword "0p3nNM5" config:property-set indexPrefix "dc1-" config:property-set elasticIndexStrategy "daily"; config:property-set connTimeout 30000; config:property-set readTimeout 300000; config:update; config:list '(service.pid=org.opennms.features.situation-feedback.persistence.elastic)' "

The above script reconfigures the feature. Internally, that triggers the reload of all the dependent bundles, and after that, the plugin works as intended. Most of the time, after doing this, the feature survives an OpenNMS restart. But, if the user removes the content of the data directory prior to starting OpenNMS (like on an upgrade), the above script must be executed one more time to fix the problem.

Acceptance / Success Criteria

None

Confluence content

mentioned on

Lucidchart Diagrams

Activity

Show:

Stefan Wachter May 4, 2021 at 1:47 PM

Stefan Wachter May 4, 2021 at 11:22 AM
Edited

This issue seems not to be caused by a race condition. (Maybe it appeared as a race condition because of https://opennms.atlassian.net/browse/NMS-12766#icft=NMS-12766.)

The problem is that a "-" in cfg filenames like in org.opennms.features.situation-feedback.persistence.elastic.cfg has a special meaning to the Apache Felix file installer: it is used to separate the PID of a managed service factory and a subname (cf. file install configurations).

In order to fix the problem a filename with a minus sign must be used for the configuration and the persistent-id of the corresponding cm:property-placeholder must be set accordingly. I propose to simply drop the minus sign from the filename.

In addition, documentation has to be adapted.

Alejandro Galue June 23, 2020 at 3:01 PM

Unfortunately, I don't know the details about how Karaf behaves internally, as it is still a black-box to me.

All those commands were shared with me. I never invested time understanding how they actually work.

Matthew Brooks June 22, 2020 at 10:36 PM

I'm wondering if the config is actually loaded but not used properly due to a race condition and resetting the config is just fixing the race condition by reloading the bundle.

I'm not sure if config:list '(service.pid=org.opennms.features.situation-feedback.persistence.elastic)' will show config loaded from the cfg files so maybe thats why it is showing empty. Do you happen to know if thats how it works?

Alejandro Galue June 22, 2020 at 9:39 PM

More details:

[vagrant@cerniossfabls24 ~]$ cat /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg elasticUrl=https://elastic:9200 globalElasticUser=elastic globalElasticPassword=0p3nNM5 indexPrefix=dc1- elasticIndexStrategy=daily connTimeout=30000 readTimeout=300000 [vagrant@cerniossfabls24 ~]$ ls -ld /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg -rw-r--r--. 1 root root 172 Jun 22 17:30 /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg [vagrant@cerniossfabls24 ~]$ /opt/opennms/bin/fix-situation-feedback.sh Password authentication Password: ---------------------------------------------------------------- Pid: org.opennms.features.situation-feedback.persistence.elastic BundleLocation: mvn:org.opennms.features.situation-feedback/org.opennms.features.situation-feedback.elastic/26.1.1 Properties: connTimeout = 30000 elasticIndexStrategy = daily elasticUrl = https://elastic:9200 felix.fileinstall.filename = file:/opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg globalElasticPassword = 0p3nNM5 globalElasticUser = elastic indexPrefix = dc1- readTimeout = 300000 service.pid = org.opennms.features.situation-feedback.persistence.elastic [vagrant@cerniossfabls24 ~]$ ls -ld /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg -rw-r--r--. 1 root root 186 Jun 22 17:37 /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg [vagrant@cerniossfabls24 ~]$ cat /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg elasticUrl = https://elastic:9200 globalElasticUser = elastic globalElasticPassword = 0p3nNM5 indexPrefix = dc1- elasticIndexStrategy = daily connTimeout = 30000 readTimeout = 300000

The exact same file was updated with the exact same content (unless the spaces around the equal sign are required, which is not the case of the other Elasticsearch plugins).

Fixed

Details

Assignee

Reporter

Labels

HB Grooming Date

HB Backlog Status

Components

Sprint

Affects versions

Priority

PagerDuty

Created June 22, 2020 at 9:12 PM
Updated May 12, 2021 at 12:40 PM
Resolved May 6, 2021 at 4:00 PM

Flag notifications