Race condition when enabling the Situations Feedback feature
Description
Acceptance / Success Criteria
related to
Lucidchart Diagrams
Activity
Stefan Wachter May 4, 2021 at 11:22 AMEdited
This issue seems not to be caused by a race condition. (Maybe it appeared as a race condition because of https://opennms.atlassian.net/browse/NMS-12766#icft=NMS-12766.)
The problem is that a "-" in cfg filenames like in org.opennms.features.situation-feedback.persistence.elastic.cfg
has a special meaning to the Apache Felix file installer: it is used to separate the PID of a managed service factory and a subname (cf. file install configurations).
In order to fix the problem a filename with a minus sign must be used for the configuration and the persistent-id
of the corresponding cm:property-placeholder
must be set accordingly. I propose to simply drop the minus sign from the filename.
In addition, documentation has to be adapted.
Alejandro Galue June 23, 2020 at 3:01 PM
Unfortunately, I don't know the details about how Karaf behaves internally, as it is still a black-box to me.
All those commands were shared with me. I never invested time understanding how they actually work.
Matthew Brooks June 22, 2020 at 10:36 PM
I'm wondering if the config is actually loaded but not used properly due to a race condition and resetting the config is just fixing the race condition by reloading the bundle.
I'm not sure if config:list '(service.pid=org.opennms.features.situation-feedback.persistence.elastic)' will show config loaded from the cfg files so maybe thats why it is showing empty. Do you happen to know if thats how it works?
Alejandro Galue June 22, 2020 at 9:39 PM
More details:
[vagrant@cerniossfabls24 ~]$ cat /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
elasticUrl=https://elastic:9200
globalElasticUser=elastic
globalElasticPassword=0p3nNM5
indexPrefix=dc1-
elasticIndexStrategy=daily
connTimeout=30000
readTimeout=300000
[vagrant@cerniossfabls24 ~]$ ls -ld /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
-rw-r--r--. 1 root root 172 Jun 22 17:30 /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
[vagrant@cerniossfabls24 ~]$ /opt/opennms/bin/fix-situation-feedback.sh
Password authentication
Password:
----------------------------------------------------------------
Pid: org.opennms.features.situation-feedback.persistence.elastic
BundleLocation: mvn:org.opennms.features.situation-feedback/org.opennms.features.situation-feedback.elastic/26.1.1
Properties:
connTimeout = 30000
elasticIndexStrategy = daily
elasticUrl = https://elastic:9200
felix.fileinstall.filename = file:/opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
globalElasticPassword = 0p3nNM5
globalElasticUser = elastic
indexPrefix = dc1-
readTimeout = 300000
service.pid = org.opennms.features.situation-feedback.persistence.elastic
[vagrant@cerniossfabls24 ~]$ ls -ld /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
-rw-r--r--. 1 root root 186 Jun 22 17:37 /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
[vagrant@cerniossfabls24 ~]$ cat /opt/opennms/etc/org.opennms.features.situation-feedback.persistence.elastic.cfg
elasticUrl = https://elastic:9200
globalElasticUser = elastic
globalElasticPassword = 0p3nNM5
indexPrefix = dc1-
elasticIndexStrategy = daily
connTimeout = 30000
readTimeout = 300000
The exact same file was updated with the exact same content (unless the spaces around the equal sign are required, which is not the case of the other Elasticsearch plugins).
OpenNMS has multiple optional features installed via Karaf that require configuration to work.
Usually, and because I always use automation tools to configure the solutions for customers and when building test environments, it is an unattended script responsible for configuring everything.
All the features I tried in the past can be configured this way, and never had issues with them.
Unfortunately, that is not the case with Situations Feedback. After a clean installation (i.e., nothing inside
/opt/opennms/data
), the feature tries to start but it doesn't seem to work, and I can see the following on the Karaf shell:admin@opennms> health-check Verifying the health of the container Verifying installed bundles [ Success ] ALEC :: Driver [ Success ] => Tick duration (99 percentile): 22 ms Connecting to ElasticSearch ReST API (Flows) [ Success ] => Not configured Number of active alarms stored in Elasticsearch (Alarm History) [ Success ] => Found 3 alarms. Connecting to ElasticSearch ReST API (Situation Feedback) [ Timeout ] => Health Check did not finish within 5000 ms
This Situations Feedback plugin is configured very similarly to the Elasticsearch forwarders for Events and Alarm History. These two always work when setting them automatically (never had issues with them or any other feature), but I haven't found a way to start the Situations Feedback in a similar way that always works.
The following is the easiest fix I found that always works for me:
#!/bin/bash ssh -p 8101 admin@localhost "\ config:edit org.opennms.features.situation-feedback.persistence.elastic; config:property-set elasticUrl "https://elastic:9200"; config:property-set globalElasticUser "elastic" config:property-set globalElasticPassword "0p3nNM5" config:property-set indexPrefix "dc1-" config:property-set elasticIndexStrategy "daily"; config:property-set connTimeout 30000; config:property-set readTimeout 300000; config:update; config:list '(service.pid=org.opennms.features.situation-feedback.persistence.elastic)' "
The above script reconfigures the feature. Internally, that triggers the reload of all the dependent bundles, and after that, the plugin works as intended. Most of the time, after doing this, the feature survives an OpenNMS restart. But, if the user removes the content of the data directory prior to starting OpenNMS (like on an upgrade), the above script must be executed one more time to fix the problem.