Improve robustness of CassandraBlobStore for async operations
Description
Acceptance / Success Criteria
Lucidchart Diagrams
Activity

Matthew Brooks October 2, 2019 at 9:20 PM
After thinking about this problem for a while I decided not to fix it. I don't think it is really a problem. The blobstore impl should not be responsible for adding error handling and retry logic, really the user of the blobstore should be responsible for that. If we were to add retry logic it would make the Cassandra blobstore very complicated and I suspect would eventually force blocking behaviour onto the callers of what is supposed to be an asynchronous call.
Instead I updated the benchmark shell command to add its own retry logic for calls that fail due to overwhelming the connection pool. I also updated the Cassandra blobstore so that truncateContextAsync now does its delete's synchronously to avoid exceeding the connections pool's available connections.

Matthew Brooks October 2, 2019 at 9:16 PM
On second look, it appears the times are probably correct and not a bug. Seems that the dramatic increase in latency is just the nature of how the async processing works when you hammer it with requests.

Matthew Brooks September 11, 2019 at 1:46 PM
While we're in here, looks like there is a bug in the benchmark tool when running async as well. The mean request latency looks like it is probably off in async mode since it is so much higher than the sync mode:

Matthew Brooks September 10, 2019 at 9:48 PM
Example of exception:
Details
Assignee
Matthew BrooksMatthew BrooksReporter
Matthew BrooksMatthew BrooksSprint
NoneFix versions
Priority
Major
Details
Details
Assignee

Reporter

Sprint
Fix versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Currently the CassandraBlobStore can overwhelm the Cassandra cluster with async requests if too many requests are in flight relative to how many async connections the cluster is allowing (looks like 250 for a single node cluster by default).
When this happens the operation will throw an exception when the result future is inspected indicating the operation was not processed.
To avoid this happening we could add logic to the CassandraBlobStore to only allow a certain amount of requests to be in flight at once.
The main situation that this will be problematic in is managing thresholding states. Specifically when clearing all thresholding states since we attempt an async delete on each of the states (and there may be many thousands).
We should probably use a global gate of some sort (resilience4j?) that only allows X number of in flight async requests at once regardless of operation to ensure we never overwhelm Cassandra connection pool (this number should be configurable).
The problem can be reproduced using the benchmark command with an appropriately large number of async requests: