Minion: A programmatic means of obtaining health (alternate to 'opennms:health-check')

Description

A request to implement an alternative to obtaining minion health information - other than running the Karaf shell command 'opennms:health-check'.

The intent is to use this facility programatically. Ideally, this would be implemented as a ReST API - where a GET request would provide a result in JSON.

Acceptance / Success Criteria

None

Linked issues

depends on

NMS-13311

Support Rest API on Minion & Enable health-check REST feature

related to

NMS-13312

Health-check: provide restful api to query health for different tags

NMS-12685

Improve health-check to be more aligned with Kubernetes Probes

Lucidchart Diagrams

Activity

Jane Hou June 1, 2021 at 12:54 PM

Closing this Jira at the point because we have https://issues.opennms.org/browse/NMS-13311 done so the RESTful API is provided now. There are a few discussion went on last week, and the plan keeps changing. It turns to be an epic now about the details of how to do a health check. All the plans will be updated on https://confluence.internal.opennms.com/pages/viewpage.action?spaceKey=NMS&title=Health+Check+on+Minion

Jane Hou May 27, 2021 at 3:02 PM
Edited

We had a meeting to discuss more details and plans about this topic today (attendees: Veena, Chandra, Ben, Chris, Stephan, Pierre, Chantal, Jane💃). We decided at this point, minion provide a RESTful API for all components health-check(which is the same as what we get from SSH). Meanwhile, besides the health check for bundles we already did in the last comment, we also provide RESTful API for individual component health-check(https://confluence.internal.opennms.com/pages/viewpage.action?spaceKey=NMS&title=Health+Check+on+Minion). Cloud team may have a confluence page describe the problems they have with ssh and health-check and solutions we currently have and follow up with more feedback, and we can decide what we can do more to help them.

Jane Hou May 19, 2021 at 3:07 PM
Edited

Zoe clarified this today in the sprint meeting. So the request consists of two parts. One is to provide REST API for Minion. The other one is to provide a health check which is light weighted than our existing one. This one only indicates if the service is alive or not but does not include the health of the connection between the service and others. This is similar to the health/readiness in the Kubernetes.

After discussing with Chandra today, we are going to do the following:

Enable the Karaf feature "health-check-rest" in Minion so that we can use a REST call to do health-check like OpenNMS does today. https://issues.opennms.org/browse/NMS-13311
Add RESTful API to "health-check-rest" feature so that we can get status of only the bundles of minion itself without other components. https://issues.opennms.org/browse/NMS-13312

Jane Hou May 13, 2021 at 1:07 AM
Edited

Former user Just some food for thoughts, the original use case for the health-check was for a user a way to verify if the configuration provided works and was intended as a human interface. With the possibility of running Minions or other components in containers and orchestration, e.g. k8s, health checks are required to build resilient services. We simply reused our health-check command here. Cause it was built more like an expensive diagnostic test run manually by a user it hits resources heavily done automatically on a higher frequency and a large volume of Minions.
In our Cloud Portal we use the health-check for two purposes, a) indicating the user a Minion is working or not, it's more a monitoring use case, and b) using it more in an orchestration use case to let the portal automatically decide if it makes sense to kill the Minion process and restart to get something working on his own or do we have a misconfiguration which can't be solved just with restarts and requires immediate human interaction to get it back running.
In k8s world, this is described as liveness and readiness probes documented in NMS-10553 which has by design a nature running on a sub minute interval by the orchestration software and is evaluated programmatically.

Yeah, I agree, in Kubernetes, it would be better to have both "health/liveness" and "health/readiness". But as far as I can tell, a lot of them are at the system level, such as after finish preparing the environment, load the bean definition, etc, then it changes the liveness to "correct" or "broken", then after the application called or command line running, the readiness switch to "accepting-traffic". For what mentioned, to determine if there is a misconfiguration that can't be resolved by just restarting, is that what our health check currently does?

By the way, since I'm relatively new to OpenNMS, and this is just wanna learn, do we have OpenNMS/Minon deployed to Kubernetes already? If yes, is it in lab environment or production already? If no, what environment our minion currently deployed to? Thanks.

Jane Hou May 13, 2021 at 12:36 AM

"Additional information here - scale considerations should be kept in mind.
This new facility must provide health-check information in a way that will not overload the OpenNMS instance (since it will be called regularly)."

For this, I don't have much idea off top of my head of how to make it obvious light-weighted. Normally, HTTPS can be slower than SSH, but it depends, not always. For REST service, normally a service can afford hundreds of requests per second. So 1 request per 5 seconds should be fine. As far as I can tell, a lot of services use RESTful API to do the health check in Kubernetes, so I guess RESTful should be fine for us too?

Fixed

Details
Assignee
Jane Hou
Reporter
Pierre Bouffard
HB Backlog Status
Backlog CM
Components
Fix versions
28.0.0
Priority
Major

PagerDuty

Created May 4, 2021 at 2:16 PM

Updated June 18, 2021 at 7:56 PM

Resolved June 1, 2021 at 12:54 PM

Minion: A programmatic means of obtaining health (alternate to 'opennms:health-check')

Description

Acceptance / Success Criteria

Linked issues

depends on

related to

Lucidchart Diagrams

Activity

Jane Hou June 1, 2021 at 12:54 PM

Jane Hou May 27, 2021 at 3:02 PMEdited

Jane Hou May 19, 2021 at 3:07 PMEdited

Jane Hou May 13, 2021 at 1:07 AMEdited

Jane Hou May 13, 2021 at 12:36 AM

DetailsAssigneeJane HouJane HouReporterPierre BouffardPierre BouffardHB Backlog StatusBacklog CMComponentsFix versions28.0.0PriorityMajor

Details

Assignee

Reporter

HB Backlog Status

Components

Fix versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Jane Hou May 27, 2021 at 3:02 PM
Edited

Jane Hou May 19, 2021 at 3:07 PM
Edited

Jane Hou May 13, 2021 at 1:07 AM
Edited

Details
Assignee
Jane Hou
Reporter
Pierre Bouffard
HB Backlog Status
Backlog CM
Components
Fix versions
28.0.0
Priority
Major

PagerDuty