RingBufferTimeseriesWriter.destroy can take a long time or hang due to BlockingServiceLookup.lookup in WorkProcessors
Description
Acceptance / Success Criteria
Activity
DJ Gregor February 3, 2023 at 3:37 AMEdited
Found an issue where the force shutdown wasn’t working as expected. Once we have decided to force shutdown, LMAX's drainAndHalt
will still wait until draining is complete.
The PR in https://github.com/OpenNMS/opennms/pull/5767 has onEvent
throw away metrics at this point.
I should note that I found this issue while working on https://opennms.atlassian.net/browse/NMS-15351. You can see the example where I found it in the description of https://github.com/OpenNMS/opennms/pull/5758. The ... snip ...
part where it was doing Installing features: [opennms-plugins-cortex-tss]
(and behind the scenes, it was waiting to destroy the opennms-timeseries-api
) was about 2-3 minutes long, which is what made me realize that the 30 seconds shutdown timeout wasn't working.
DJ Gregor January 23, 2023 at 12:15 AM
One thing I noticed: when we start to drain the worker pool, we don’t do anything to prevent new work from coming in. This can lead to the destroy hanging indefinitely. I added a stats logging thread when destroy is called, and this is what I see:
2023-01-22T19:09:16,533 | INFO | FelixFrameworkWiring | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | Draining and halting the time series worker pool. Entries in ring buffer: 0
2023-01-22T19:09:16,535 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:21,538 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:26,540 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:31,541 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:36,542 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:41,543 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 28
2023-01-22T19:09:46,544 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 31
2023-01-22T19:09:51,546 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:09:56,549 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:01,554 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:06,556 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:11,557 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:16,558 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 77
2023-01-22T19:10:21,557 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78
2023-01-22T19:10:26,559 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78
Also, we decrement the entry counter at the top of onEvent
, not at the end, so the stats shown are a little misleading because they do not include the work in progress when we’re blocked in onEvent
(and you think it should have shutdown by now).
Details
Assignee
DJ GregorDJ GregorReporter
DJ GregorDJ GregorHB Grooming Date
Jan 23, 2023HB Backlog Status
Refined BacklogSprint
NoneFix versions
Priority
Medium
Details
Details
Assignee
Reporter
HB Grooming Date
HB Backlog Status
Sprint
Fix versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

In certain cases,
RingBufferTimeseriesWriter.destroy
can hang for awhile (5+ minutes or indefinitely) due to a BlockingServiceLookup.lookup in WorkProcessors. I think this is what causes it:The
opennms-timeseries-api
feature is enabled.The
org.opennms.features.org.opennms.features.timeseries
container in Karaf is being destroyed (this seems to happen when OpenNMS is in the process of refreshing theorg.opennms.features.org.opennms.features.timeseries
container).There is data in the ring buffer and one or more
WorkProcessor
s are trying to work on those samples.No
TimeSeriesStorage
implementation is available.For #2, I see this in karaf.log:
2023-01-21T16:12:14,062 | INFO | FelixFrameworkWiring | BlueprintExtender | 29 - org.apache.aries.blueprint.core - 1.10.3 | Destroying container for blueprint bundle org.opennms.features.org.opennms.features.timeseries/31.0.2
Looking at a stacktrace, I can see it waiting on
com.lmax.disruptor.WorkerPool.drainAndHalt
."FelixFrameworkWiring" #393 daemon prio=5 os_prio=0 cpu=987666.87ms elapsed=1007.34s tid=0x0000fffe38a47800 nid=0x229 runnable [0x0000fffe3cffa000] java.lang.Thread.State: RUNNABLE at java.lang.Thread.yield(java.base@11.0.16/Native Method) at com.lmax.disruptor.WorkerPool.drainAndHalt(WorkerPool.java:148) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.destroy(RingBufferTimeseriesWriter.java:162) at org.opennms.netmgt.timeseries.samplewrite.TimeseriesPersisterFactory.destroy(TimeseriesPersisterFactory.java:106) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.16/Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.16/NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.16/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.16/Method.java:566) at org.apache.aries.blueprint.utils.ReflectionUtils.invoke(ReflectionUtils.java:337) at org.apache.aries.blueprint.container.BeanRecipe.invoke(BeanRecipe.java:835) at org.apache.aries.blueprint.container.BeanRecipe.destroy(BeanRecipe.java:742) ...
com.lmax.disruptor.WorkerPool.drainAndHalt
doesn’t seem to have any kind of a timeout--it waits for the work pool to be completely drained. Looking at the workers I see a whole bunch of these (TimeseriesWriter-Consumer-0
throughTimeseriesWriter-Consumer-15
):"TimeseriesWriter-Consumer-0" #1558 prio=5 os_prio=0 cpu=19.06ms elapsed=990.90s tid=0x0000aaab056cf000 nid=0x6b5 sleeping [0x0000fffe2d7fd000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(java.base@11.0.16/Native Method) at org.opennms.core.soa.lookup.BlockingServiceLookup.lookup(BlockingServiceLookup.java:72) at org.opennms.netmgt.timeseries.TimeseriesStorageManagerImpl.get(TimeseriesStorageManagerImpl.java:67) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:199) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:75) at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:138) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
It looks like
BlockingServiceLookup.lookup
can block for 5 minutes waiting for a lookup. And it can do that for every timeRingBufferTimeseriesWriter.onEvent
is called.It seems that something in Karaf is causing
org.opennms.features.org.opennms.features.timeseries
to be reloaded. When the old one is destroyed, that is waiting oncom.lmax.disruptor.WorkerPool.drainAndHalt
, which waits on the work pool to drain, and the workers are waiting onTimeseriesStorageManagerImpl.get
andBlockingServiceLookup.lookup
which seems to be waiting for a TimeSeriesStorage lookup that never comes (probably because Karaf just destroyed it?).I’ve been able to recreate this fairly readily by using onms-k8s-poc with cortex and dual-write enabled and cranking the collectd and poller intervals down real low,
interval="3000"
(3 seconds).