Minions > v27.0.0 stop processing flows after apprx 5 minutes
Description
Acceptance / Success Criteria
Is caused by
Lucidchart Diagrams
Activity

Christian Pape August 11, 2021 at 1:25 PM
Merged.

Will Keaney August 10, 2021 at 9:12 PM
I cherry-picked 4ea8074 onto the release-28.x branch, built it, and deployed to a single location. It's been running there for about 30 minutes, and Netflow-9 flows from that location are still visible in Helm / Grafana.

Will Keaney August 10, 2021 at 5:48 PM
Some notes from my own testing:
Minion 27.0.5
both sflow and netflow enabled, with rDNS disabled: netflow stopped within about 2 minutes
sflow disabled, netflow enabled with rDNS enabled: netflow stopped within about 5 minutes
sflow disabled, netflow enabled with rDNS disabled: netflow stopped within about 4 minutes
Minion 28.0.1
both sflow and netflow enabled, with rDNS disabled: netflow stopped within 2 minutes
sflow disabled, netflow enabled with rDNS enabled: netflow stopped within 2 minutes
sflow disabled, netflow enabled with rDNS disabled: netflow stopped within 4 minutes
In all cases, the Minion continues to log messages about receiving and processing flows. But they're not returned by Helm queries from Grafana.

Christian Pape August 10, 2021 at 12:20 PMEdited
The PCAP file showed a pretty uncommon Option Template with a observation id scope (system scope) and defined values for absolute timestamps (information element 152 and 153) and the observation domain name. As far as I understand this means something like this: the system identified by the observation domain id X is named Y during the period of time between W and Z. But in this case I would expect the timestamps to be part of the scope. But something like this isn’t really defined in the RFC, since scopes are matched value by value and no intervals are checked at all. This also means that each an every flow that matches the system id will be set the two timestamps given in the option template. We now assure, that these values set by the data record are not overwritten by option record data.
This will lead to wrong flow data with incorrect timestamps and intervals that are longer than the defined FLOW_ACTIVE_TIMEOUT or FLOW_INACTIVE_TIMEOUT. In the case of the provided PCAP the duration was approximately 50 minutes to 1 hour. Maybe this will cause problems in the overall processing of flows.
Please review:
I can not tell whether this will solve all the observed and reported problems but this will at least correct false data that could probably introduce problems in our processing pipeline.
Details
Assignee
Christian PapeChristian PapeReporter
Dino YanceyDino YanceyLabels
Components
Sprint
NoneFix versions
Priority
High
Details
Details
Assignee

Reporter

Labels
Components
Sprint
Fix versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

We narrowed this down to https://github.com/OpenNMS/opennms/commit/7bfbf63a389b7d71e7edca6e1fa20acf40600046 and have verified that a minion built without this commit continues to function as expected in the customer environment.
Only Netflow-9 is affected. SFlow was fine with vanilla 27.0.1 through 28.0.1