RingBufferTimeseriesWriter.destroy can take a long time or hang due to BlockingServiceLookup.lookup in WorkProcessors

Description

In certain cases, RingBufferTimeseriesWriter.destroy can hang for awhile (5+ minutes or indefinitely) due to a BlockingServiceLookup.lookup in WorkProcessors. I think this is what causes it:

  1. The opennms-timeseries-api feature is enabled.

  2. The org.opennms.features.org.opennms.features.timeseries container in Karaf is being destroyed (this seems to happen when OpenNMS is in the process of refreshing the org.opennms.features.org.opennms.features.timeseries container).

  3. There is data in the ring buffer and one or more WorkProcessors are trying to work on those samples.

  4. No TimeSeriesStorage implementation is available.

For #2, I see this in karaf.log:

2023-01-21T16:12:14,062 | INFO | FelixFrameworkWiring | BlueprintExtender | 29 - org.apache.aries.blueprint.core - 1.10.3 | Destroying container for blueprint bundle org.opennms.features.org.opennms.features.timeseries/31.0.2

Looking at a stacktrace, I can see it waiting on com.lmax.disruptor.WorkerPool.drainAndHalt.

"FelixFrameworkWiring" #393 daemon prio=5 os_prio=0 cpu=987666.87ms elapsed=1007.34s tid=0x0000fffe38a47800 nid=0x229 runnable [0x0000fffe3cffa000] java.lang.Thread.State: RUNNABLE at java.lang.Thread.yield(java.base@11.0.16/Native Method) at com.lmax.disruptor.WorkerPool.drainAndHalt(WorkerPool.java:148) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.destroy(RingBufferTimeseriesWriter.java:162) at org.opennms.netmgt.timeseries.samplewrite.TimeseriesPersisterFactory.destroy(TimeseriesPersisterFactory.java:106) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.16/Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.16/NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.16/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.16/Method.java:566) at org.apache.aries.blueprint.utils.ReflectionUtils.invoke(ReflectionUtils.java:337) at org.apache.aries.blueprint.container.BeanRecipe.invoke(BeanRecipe.java:835) at org.apache.aries.blueprint.container.BeanRecipe.destroy(BeanRecipe.java:742) ...

com.lmax.disruptor.WorkerPool.drainAndHalt doesn’t seem to have any kind of a timeout--it waits for the work pool to be completely drained. Looking at the workers I see a whole bunch of these (TimeseriesWriter-Consumer-0 through TimeseriesWriter-Consumer-15):

"TimeseriesWriter-Consumer-0" #1558 prio=5 os_prio=0 cpu=19.06ms elapsed=990.90s tid=0x0000aaab056cf000 nid=0x6b5 sleeping [0x0000fffe2d7fd000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(java.base@11.0.16/Native Method) at org.opennms.core.soa.lookup.BlockingServiceLookup.lookup(BlockingServiceLookup.java:72) at org.opennms.netmgt.timeseries.TimeseriesStorageManagerImpl.get(TimeseriesStorageManagerImpl.java:67) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:199) at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:75) at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:138) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

It looks like BlockingServiceLookup.lookup can block for 5 minutes waiting for a lookup. And it can do that for every time RingBufferTimeseriesWriter.onEvent is called.

It seems that something in Karaf is causing org.opennms.features.org.opennms.features.timeseries to be reloaded. When the old one is destroyed, that is waiting on com.lmax.disruptor.WorkerPool.drainAndHalt, which waits on the work pool to drain, and the workers are waiting on TimeseriesStorageManagerImpl.get and BlockingServiceLookup.lookup which seems to be waiting for a TimeSeriesStorage lookup that never comes (probably because Karaf just destroyed it?).

I’ve been able to recreate this fairly readily by using onms-k8s-poc with cortex and dual-write enabled and cranking the collectd and poller intervals down real low, interval="3000" (3 seconds).

Acceptance / Success Criteria

None

Activity

Show:

DJ Gregor February 3, 2023 at 3:37 AM
Edited

Found an issue where the force shutdown wasn’t working as expected. Once we have decided to force shutdown, LMAX's drainAndHalt will still wait until draining is complete.

The PR in https://github.com/OpenNMS/opennms/pull/5767 has onEvent throw away metrics at this point.

I should note that I found this issue while working on https://opennms.atlassian.net/browse/NMS-15351. You can see the example where I found it in the description of https://github.com/OpenNMS/opennms/pull/5758. The ... snip ... part where it was doing Installing features: [opennms-plugins-cortex-tss] (and behind the scenes, it was waiting to destroy the opennms-timeseries-api) was about 2-3 minutes long, which is what made me realize that the 30 seconds shutdown timeout wasn't working.

DJ Gregor January 23, 2023 at 12:15 AM

One thing I noticed: when we start to drain the worker pool, we don’t do anything to prevent new work from coming in. This can lead to the destroy hanging indefinitely. I added a stats logging thread when destroy is called, and this is what I see:

2023-01-22T19:09:16,533 | INFO | FelixFrameworkWiring | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | Draining and halting the time series worker pool. Entries in ring buffer: 0 2023-01-22T19:09:16,535 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0 2023-01-22T19:09:21,538 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0 2023-01-22T19:09:26,540 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0 2023-01-22T19:09:31,541 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0 2023-01-22T19:09:36,542 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0 2023-01-22T19:09:41,543 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 28 2023-01-22T19:09:46,544 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 31 2023-01-22T19:09:51,546 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32 2023-01-22T19:09:56,549 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32 2023-01-22T19:10:01,554 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32 2023-01-22T19:10:06,556 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32 2023-01-22T19:10:11,557 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32 2023-01-22T19:10:16,558 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 77 2023-01-22T19:10:21,557 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78 2023-01-22T19:10:26,559 | INFO | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78

Also, we decrement the entry counter at the top of onEvent, not at the end, so the stats shown are a little misleading because they do not include the work in progress when we’re blocked in onEvent (and you think it should have shutdown by now).

Fixed

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Sprint

Priority

PagerDuty

Created January 22, 2023 at 11:57 PM
Updated February 6, 2023 at 2:26 AM
Resolved February 3, 2023 at 6:39 PM

Flag notifications