Thresholding blocks threads under load

Description

After upgrading from 29.0.5 to 30.0.1, we noticed that the normal operating behavior of OpenNMS has changed under our load.

We have seen a couple of instances where a number of threads start being blocked and the system becomes backlogged with work.

We were able to catch a stack trace during one of the incidents and found the thread blocked by Cassandra newts removeall synchronized function.

 

"NewtsWriter-Consumer-27" #406 prio=5 os_prio=0 cpu=11128514.87ms elapsed=268963.26s tid=0x00007fa5268f8000 nid=0x44ead runnable  [0x00007fa1b8ead000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap$KeySet.iterator(java.base@11.0.14.1/HashMap.java:913)
        at java.util.HashSet.iterator(java.base@11.0.14.1/HashSet.java:173)
        at java.util.AbstractSet.removeAll(java.base@11.0.14.1/AbstractSet.java:174)
        at org.opennms.newts.cassandra.search.CassandraIndexer.update(CassandraIndexer.java:144)
        - locked <0x00007fa900f4a4a0> (a java.util.HashSet)
        at org.opennms.newts.cassandra.search.CassandraIndexerSampleProcessor.submit(CassandraIndexerSampleProcessor.java:45)
        at org.opennms.netmgt.newts.support.SimpleSampleProcessorService.lambda$submit$0(SimpleSampleProcessorService.java:63)
        at org.opennms.netmgt.newts.support.SimpleSampleProcessorService$$Lambda$1961/0x00007f9bf5f6d8b0.accept(Unknown Source)
        at java.util.HashMap$KeySpliterator.forEachRemaining(java.base@11.0.14.1/HashMap.java:1621)
        at java.util.stream.ReferencePipeline$Head.forEach(java.base@11.0.14.1/ReferencePipeline.java:658)
        at org.opennms.netmgt.newts.support.SimpleSampleProcessorService.submit(SimpleSampleProcessorService.java:63)
        at org.opennms.newts.persistence.cassandra.CassandraSampleRepository.insert(CassandraSampleRepository.java:279)
        at org.opennms.newts.persistence.cassandra.CassandraSampleRepository.insert(CassandraSampleRepository.java:232)
        at org.opennms.netmgt.newts.NewtsWriter.onEvent(NewtsWriter.java:220)
        at org.opennms.netmgt.newts.NewtsWriter.onEvent(NewtsWriter.java:75)
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:138)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)

Acceptance / Success Criteria

None

Attachments

1

Lucidchart Diagrams

Activity

Show:

fooker August 24, 2022 at 7:45 AM

fooker August 22, 2022 at 12:05 PM

The thresholding is locked because something triggered a reload of thresd:

All active thresholding threads are waiting on this lock.

Jesse White August 12, 2022 at 2:18 PM

Do we see this same lock contention and stack trace repeatable, or is this just one data point?

The fact that the threads were here at some point in time doesn't necessarily imply a problem - this critical section could be quick and real problem may be elsewhere.

Sean Torres August 12, 2022 at 12:28 AM

Stack attached

Benjamin Reed August 11, 2022 at 7:49 PM

Anyway, this will take deeper investigation than the quick look I was hoping it would be. I'm gonna un-assign this from myself and let it get into the sprint properly, since I'm afraid I can't give it my full attention.

Fixed

Details

Assignee

Reporter

HB Grooming Date

HB Backlog Status

Docs Needed

No

FD#

Components

Sprint

Fix versions

Affects versions

Priority

PagerDuty

Created August 11, 2022 at 2:05 PM
Updated September 8, 2022 at 4:45 PM
Resolved August 25, 2022 at 12:01 PM