Failures when jaeger tracing is enabled on Core server and Minion
Description
Environment
Acceptance / Success Criteria
Lucidchart Diagrams
Activity

DJ Gregor December 12, 2022 at 4:11 PM
Fix merged into release-31.x.
I had looked into a few different solutions, and went with option #1 which was the simplest (but there were still fifteen commits, heh):
1. Update Jaeger client libraries to work with new version of Apache Thrift (this is what I went with, and it required updating a few other libraries and dealing with some API changes). This is what was in https://github.com/OpenNMS/opennms/pull/5281
2. Switch to OpenTelemetry with the OpenTracing shim, so minimal changes would be needed in the places that use OpenTracing today. There is an experimental PR for this: https://github.com/OpenNMS/opennms/pull/4991
3. Go straight to OpenTelemetry and convert over existing code using the OpenTracing API to OpenTelemetry. No PR for this yet.
I'll open up a ticket to do #2 or #3 (#3 most likely) in develop for going-forward use.

DJ Gregor December 11, 2022 at 10:42 PMEdited
There are two additional issues I found while working on this. Note that both of these have always been broken, AFAICT, and weren’t introduced with the upgrade to the Apache Thrift library. I can reproduce both of them with the opennms/minion:30.0.4
Docker image.
The Docker minion confd templates and related documentation for Jaeger configuration are wrong--they reference
jaeger-agent-host
as the system property for configuration, but the Jaeger client library expectsJAEGER_AGENT_HOST
. Our general documentation on tracing is correct, but the Configuring Minion via confd documentation and confd templates are not.Sending Jaeger data over TCP using
JAEGER_ENDPOINT
doesn’t work because of issues with okhttp3 (see exception below). In general, I prefer usingJAEGER_ENDPOINT
because sending UDP packets (which is whatJAEGER_AGENT_HOST
does) can be very problematic when they are sent to another host. They can be dropped due to the packets being too large (which can lead to confusingly getting some trace data but not others), lack of UDP support in certain container environments like (Co)Lima, etc.. See .
For #2 above, I tested with this configuration:
Here’s the exception:

DJ Gregor September 14, 2022 at 12:17 AMEdited
I just had this pop up in a Minion in a new way–when I enabled tracing, it stopped communicating with OpenNMS (I had it configured for gRPC).

DJ Gregor August 3, 2022 at 12:24 AM
I just ran into this on a clean develop system with Jaeger tracing enabled:
Apache Thrift was upgraded from 0.12.0 to 0.14.0 in https://github.com/OpenNMS/opennms/pull/4939
Unfortunately, the Jaeger client library we currently use, 0.34.0 is incompatible with the newer version of Thrift. It's POM calls for 0.12.0: https://repo1.maven.org/maven2/io/jaegertracing/jaeger-thrift/0.34.0/jaeger-thrift-0.34.0.pom
It looks like version 1.6.0 of the jaeger-client fixes this problem: https://github.com/jaegertracing/jaeger-client-java/pull/774
It looks like Thrift 0.12.0 is still in release-30.x, so this problem should not exist in that branch... only in develop, it looks like.
We should probably have a test in CI that tries to startup OpenNMS with tracing enabled, as that would have caught this. Any suggestions on how to go about that? ?
I'm happy to take a shot at trying a new Jaeger client library and adding a CI check.
When you enable Jaeger tracing on the core system the following exception is thrown on system startup for the core system:
The error appears the first time when a flow was received on the Minion and the exception is as the following: