Extremely intermittent failure to detect services with TcpDetector in unit tests

Description

While running the AsyncDetectorFileDescriptorLeakTest, I (very rarely) get a false negative to detect an open TCP port when using the TcpDetector. When this failure occurs, the sequence of log messages is slightly different than with successful detections. Maybe that is the cause of the problem? Not sure.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Seth Leger August 21, 2012 at 5:27 PM

The problem here was that the READER_IDLE timeout in the detector code was set to a default of 1 second which is probably too low of a default. I raised the default to 3 seconds and refactored the code so that it is always specified as a millisecond value. This should prevent the IDLE timeouts that we're seeing on Bamboo. Marking as fixed.

commit e3de2b7e4af8d30933a4e71ec0475aa447397286

Seth Leger August 13, 2012 at 1:23 PM

The fix that I made wasn't effective... it causes infinite waits on certain poller configurations. I'm reverting the change and leaving this bug open in backlog. I don't think that this case will affect many services in the real world and the worst case will be that a service that is overloaded fails to be detected.

Seth Leger August 9, 2012 at 4:57 PM

This issue was caused by the fact that a service was judged to be down if:

  • the server side of the connection went idle (IdleStatus.READER_IDLE)

  • the banner had not been read

However, under heavy load it appears that the server side of the connection can transition to IDLE before sending the banner so I changed the condition so that both sides of the connection must be idle before evaluating the state of the banner. Marking as fixed.

commit 9356a008b1e88a958b6aa616c6c3686e7aaffd32

Seth Leger April 19, 2012 at 12:51 PM

Here is the relevant log section. Notice that there is an IDLE message right before the test fails:

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDuty

Created April 19, 2012 at 12:50 PM
Updated January 27, 2017 at 4:19 PM
Resolved August 21, 2012 at 5:27 PM