Improve robustness of CassandraBlobStore for async operations

Description

Currently the CassandraBlobStore can overwhelm the Cassandra cluster with async requests if too many requests are in flight relative to how many async connections the cluster is allowing (looks like 250 for a single node cluster by default).

When this happens the operation will throw an exception when the result future is inspected indicating the operation was not processed.

To avoid this happening we could add logic to the CassandraBlobStore to only allow a certain amount of requests to be in flight at once.

The main situation that this will be problematic in is managing thresholding states. Specifically when clearing all thresholding states since we attempt an async delete on each of the states (and there may be many thousands).

We should probably use a global gate of some sort (resilience4j?) that only allows X number of in flight async requests at once regardless of operation to ensure we never overwhelm Cassandra connection pool (this number should be configurable).

The problem can be reproduced using the benchmark command with an appropriately large number of async requests:

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Matthew Brooks October 2, 2019 at 9:31 PM

PR: https://github.com/OpenNMS/opennms/pull/2752

Matthew Brooks October 2, 2019 at 9:20 PM

After thinking about this problem for a while I decided not to fix it. I don't think it is really a problem. The blobstore impl should not be responsible for adding error handling and retry logic, really the user of the blobstore should be responsible for that. If we were to add retry logic it would make the Cassandra blobstore very complicated and I suspect would eventually force blocking behaviour onto the callers of what is supposed to be an asynchronous call.

Instead I updated the benchmark shell command to add its own retry logic for calls that fail due to overwhelming the connection pool. I also updated the Cassandra blobstore so that truncateContextAsync now does its delete's synchronously to avoid exceeding the connections pool's available connections.

Matthew Brooks October 2, 2019 at 9:16 PM

On second look, it appears the times are probably correct and not a bug. Seems that the dramatic increase in latency is just the nature of how the async processing works when you hammer it with requests.

Matthew Brooks September 11, 2019 at 1:46 PM

While we're in here, looks like there is a bug in the benchmark tool when running async as well. The mean request latency looks like it is probably off in async mode since it is so much higher than the sync mode:

Matthew Brooks September 10, 2019 at 9:48 PM

Example of exception:

Fixed

Details
Assignee
Matthew Brooks
Reporter
Matthew Brooks
Sprint
None
Fix versions
Meridian-2019.1.0
25.1.0
Priority
Major

PagerDuty

Created September 10, 2019 at 8:36 PM

Updated October 8, 2019 at 7:17 PM

Resolved October 8, 2019 at 7:17 PM

Improve robustness of CassandraBlobStore for async operations

Description

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Matthew Brooks October 2, 2019 at 9:31 PM

Matthew Brooks October 2, 2019 at 9:20 PM

Matthew Brooks October 2, 2019 at 9:16 PM

Matthew Brooks September 11, 2019 at 1:46 PM

Matthew Brooks September 10, 2019 at 9:48 PM

DetailsAssigneeMatthew BrooksMatthew BrooksReporterMatthew BrooksMatthew BrooksSprintNone+3Fix versionsMeridian-2019.1.025.1.0PriorityMajor

Details

Assignee

Reporter

Sprint

Fix versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
Matthew Brooks
Reporter
Matthew Brooks
Sprint
None
Fix versions
Meridian-2019.1.0
25.1.0
Priority
Major

PagerDuty