RingBufferTimeseriesWriter.destroy can take a long time or hang due to BlockingServiceLookup.lookup in WorkProcessors

Description

In certain cases, RingBufferTimeseriesWriter.destroy can hang for awhile (5+ minutes or indefinitely) due to a BlockingServiceLookup.lookup in WorkProcessors. I think this is what causes it:

The opennms-timeseries-api feature is enabled.
The org.opennms.features.org.opennms.features.timeseries container in Karaf is being destroyed (this seems to happen when OpenNMS is in the process of refreshing the org.opennms.features.org.opennms.features.timeseries container).
There is data in the ring buffer and one or more WorkProcessors are trying to work on those samples.
No TimeSeriesStorage implementation is available.

For #2, I see this in karaf.log:

2023-01-21T16:12:14,062 | INFO  | FelixFrameworkWiring | BlueprintExtender                | 29 - org.apache.aries.blueprint.core - 1.10.3 | Destroying container for blueprint bundle
 org.opennms.features.org.opennms.features.timeseries/31.0.2

Looking at a stacktrace, I can see it waiting on com.lmax.disruptor.WorkerPool.drainAndHalt.

"FelixFrameworkWiring" #393 daemon prio=5 os_prio=0 cpu=987666.87ms elapsed=1007.34s tid=0x0000fffe38a47800 nid=0x229 runnable  [0x0000fffe3cffa000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.yield(java.base@11.0.16/Native Method)
        at com.lmax.disruptor.WorkerPool.drainAndHalt(WorkerPool.java:148)
        at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.destroy(RingBufferTimeseriesWriter.java:162)
        at org.opennms.netmgt.timeseries.samplewrite.TimeseriesPersisterFactory.destroy(TimeseriesPersisterFactory.java:106)
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.16/Native Method)
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.16/NativeMethodAccessorImpl.java:62)
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.16/DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(java.base@11.0.16/Method.java:566)
        at org.apache.aries.blueprint.utils.ReflectionUtils.invoke(ReflectionUtils.java:337)
        at org.apache.aries.blueprint.container.BeanRecipe.invoke(BeanRecipe.java:835)
        at org.apache.aries.blueprint.container.BeanRecipe.destroy(BeanRecipe.java:742)
...

com.lmax.disruptor.WorkerPool.drainAndHalt doesn’t seem to have any kind of a timeout--it waits for the work pool to be completely drained. Looking at the workers I see a whole bunch of these (TimeseriesWriter-Consumer-0 through TimeseriesWriter-Consumer-15):

"TimeseriesWriter-Consumer-0" #1558 prio=5 os_prio=0 cpu=19.06ms elapsed=990.90s tid=0x0000aaab056cf000 nid=0x6b5 sleeping [0x0000fffe2d7fd000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@11.0.16/Native Method)
        at org.opennms.core.soa.lookup.BlockingServiceLookup.lookup(BlockingServiceLookup.java:72)
        at org.opennms.netmgt.timeseries.TimeseriesStorageManagerImpl.get(TimeseriesStorageManagerImpl.java:67)
        at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:199)
        at org.opennms.netmgt.timeseries.samplewrite.RingBufferTimeseriesWriter.onEvent(RingBufferTimeseriesWriter.java:75)
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:138)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

It looks like BlockingServiceLookup.lookup can block for 5 minutes waiting for a lookup. And it can do that for every time RingBufferTimeseriesWriter.onEvent is called.

It seems that something in Karaf is causing org.opennms.features.org.opennms.features.timeseries to be reloaded. When the old one is destroyed, that is waiting on com.lmax.disruptor.WorkerPool.drainAndHalt, which waits on the work pool to drain, and the workers are waiting on TimeseriesStorageManagerImpl.get and BlockingServiceLookup.lookup which seems to be waiting for a TimeSeriesStorage lookup that never comes (probably because Karaf just destroyed it?).

I’ve been able to recreate this fairly readily by using onms-k8s-poc with cortex and dual-write enabled and cranking the collectd and poller intervals down real low, interval="3000" (3 seconds).

Acceptance / Success Criteria

None

Linked issues

related to

NMS-15326

Destroying container for blueprint bundle org.opennms.features.org.opennms.features.timeseries leads to downstream problems

Activity

Show:

DJ Gregor February 3, 2023 at 3:37 AM
Edited

Found an issue where the force shutdown wasn’t working as expected. Once we have decided to force shutdown, LMAX's drainAndHalt will still wait until draining is complete.

The PR in https://github.com/OpenNMS/opennms/pull/5767 has onEvent throw away metrics at this point.

I should note that I found this issue while working on https://opennms.atlassian.net/browse/NMS-15351. You can see the example where I found it in the description of https://github.com/OpenNMS/opennms/pull/5758. The ... snip ... part where it was doing Installing features: [opennms-plugins-cortex-tss] (and behind the scenes, it was waiting to destroy the opennms-timeseries-api) was about 2-3 minutes long, which is what made me realize that the 30 seconds shutdown timeout wasn't working.

DJ Gregor January 23, 2023 at 12:15 AM

One thing I noticed: when we start to drain the worker pool, we don’t do anything to prevent new work from coming in. This can lead to the destroy hanging indefinitely. I added a stats logging thread when destroy is called, and this is what I see:

2023-01-22T19:09:16,533 | INFO  | FelixFrameworkWiring | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | Draining and halting the time series worker pool. Entries in ring buffer: 0
2023-01-22T19:09:16,535 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:21,538 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:26,540 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:31,541 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:36,542 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 0
2023-01-22T19:09:41,543 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 28
2023-01-22T19:09:46,544 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 31
2023-01-22T19:09:51,546 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:09:56,549 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:01,554 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:06,556 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:11,557 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 32
2023-01-22T19:10:16,558 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 77
2023-01-22T19:10:21,557 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78
2023-01-22T19:10:26,559 | INFO  | RingBufferTimeseriesWriter-destroy-status | RingBufferTimeseriesWriter       | 478 - org.opennms.features.org.opennms.features.timeseries - 32.0.0.SNAPSHOT | destroy() in progress. Entries left in ring buffer to drain: 78

Also, we decrement the entry counter at the top of onEvent, not at the end, so the stats shown are a little misleading because they do not include the work in progress when we’re blocked in onEvent (and you think it should have shutdown by now).

Fixed

Details
Assignee
DJ Gregor
Reporter
DJ Gregor
HB Grooming Date
Jan 23, 2023
HB Backlog Status
Refined Backlog
Sprint
None
Fix versions
Meridian-2023.1.0
31.0.4
Priority
Medium
Parent
NMS-15348 TS Dual-Write Bugs

PagerDuty

Created January 22, 2023 at 11:57 PM

Updated February 6, 2023 at 2:26 AM

Resolved February 3, 2023 at 6:39 PM

Configure

RingBufferTimeseriesWriter.destroy can take a long time or hang due to BlockingServiceLookup.lookup in WorkProcessors

Description

Acceptance / Success Criteria

Linked issues

related to

Activity

DJ Gregor February 3, 2023 at 3:37 AM
Edited

DJ Gregor January 23, 2023 at 12:15 AM

Details
Assignee
DJ Gregor
Reporter
DJ Gregor
HB Grooming Date
Jan 23, 2023
HB Backlog Status
Refined Backlog
Sprint
None
Fix versions
Meridian-2023.1.0
31.0.4
Priority
Medium
Parent
NMS-15348 TS Dual-Write Bugs

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Sprint

Fix versions

Priority

Parent

PagerDuty

PagerDuty

Flag notifications

Something's gone wrong

RingBufferTimeseriesWriter.destroy can take a long time or hang due to BlockingServiceLookup.lookup in WorkProcessors

Description

Acceptance / Success Criteria

Linked issues

related to

Activity

DJ Gregor February 3, 2023 at 3:37 AMEdited

DJ Gregor January 23, 2023 at 12:15 AM

DetailsAssigneeDJ GregorDJ GregorReporterDJ GregorDJ GregorHB Grooming DateJan 23, 2023HB Backlog StatusRefined BacklogSprintNone+1Fix versionsMeridian-2023.1.031.0.4PriorityMediumParentNMS-15348 TS Dual-Write Bugs

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Sprint

Fix versions

Priority

Parent

PagerDutyPagerDuty Incident

PagerDuty

Flag notifications

Something's gone wrong

DJ Gregor February 3, 2023 at 3:37 AM
Edited

Details
Assignee
DJ Gregor
Reporter
DJ Gregor
HB Grooming Date
Jan 23, 2023
HB Backlog Status
Refined Backlog
Sprint
None
Fix versions
Meridian-2023.1.0
31.0.4
Priority
Medium
Parent
NMS-15348 TS Dual-Write Bugs

PagerDuty