Nephron: optimize aggregation calculation

Description

ATM the complete raw flow input is processed repeatedly (8 times) to calculate the different aggregations:

1: exporter/interface
3: exporter/interface/(app, conv, host)
1: exporter/interface/tos
3: exporter/interface/tos/(app, conv, host)

This can be reduced to 3 times if aggregations are build on top of each other. In the first step the 3 aggregations

exporter/interface/tos/(app, conv, host)

are calculated. By regrouping, all other aggregations can be derived therefrom.

Unwanted behaviour of the `accumulatingFiredPanes` aggregation mode

Using cascading aggregations revealed an unwanted behavior of the currently used "accumulatingFiredPanes" aggregation mode. In case of late data the in time data was summed up twice in the next higher aggregations.

Let o and l be some data that arrived on time and late, respectively. At the lower level two panes are emitted, containing these results:

on time pane: o
late pane: o + l

The results of these panes are fed as data into the next higher aggregation. This means that o is on time data and x=o+l is late data for the next level, resulting in these panes for the next level:

higher level on time pane: o
higher level late pane: o + x = o + o + l

The DirectRunner used for testing showed another unwanted behavior that luckily does not exist for the FlinkRunner. That behavior is documented here for completeness only:

In accumulatingFiredPanges aggregation mode, the DirectRunner reprocesses the complete contents of on-time panes when the result for late panes is calculated. For example if a, b, c are on time data and l is late data then following calculations take place:

on time pane: a+b+c
late pane: a+b+c+l

Use of `discardingFiredPanes` aggregation mode instead of `accumulatingFiredPanes`

ATM a single flow summary document is persisted in elastic search for each window/key. That document is updated in case that late data arrives.

If the aggregation mode is changed to discardingFiredPanes then the unwanted behavior described above disappears. However, on time data and late data is no more accumulated in Nephron. Additional flow summary documents have to be persisted for late data and aggregation of late data with in on time data must be done by ElasticSearch.

There are two things to keep in mind:

More flow summary documents get persisted and ElasticSearch has more aggregation work to do.
TopK calculations may be incorrect in case that some keys that did not make it into the topK keys in the on time pane receive enough of late data in later panes such that they should be part of the overall topK.

The first issue seems not to be relevant for the customer at hand. The second issue seems not to be particularly relevant either:

TopN results (based on topK aggregations) are already approximate:

ElasticSearch's term aggregation is approximate: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts
TopN queries aggregate topK flow summaries documents over some time range. If for some windows the volume for a key is just below to make it into topK and for others it is the aggregated result is only an approximation of the real volume for that key of the whole time range.

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Stefan Wachter April 15, 2021 at 9:00 AM

PR: https://github.com/OpenNMS/nephron/pull/20

Stefan Wachter January 28, 2021 at 4:38 PM

The ES queries do not change. They already aggregate FlowSummary documents over time ranges selected in Grafana. Switching to discarding aggregations just adds some more documents to aggregate.

Jesse White January 28, 2021 at 4:21 PM

Seems sensible to chain the aggregations rather than processing them in parallel - results should be the same with reduced compute requirements.

I'm curious as to what the ES queries would look like if we end up switching from accumulatingFiredPanes to discardingFiredPanes.

Fixed

Details
Assignee
Stefan Wachter
Reporter
Stefan Wachter
Sprint
None
Fix versions
28.0.0
Priority
Minor
Parent
NMS-12925 Visualize Netflow traffic by QoS

PagerDuty

Created January 22, 2021 at 12:01 PM

Updated April 22, 2021 at 7:54 AM

Resolved April 22, 2021 at 7:54 AM

Nephron: optimize aggregation calculation

Description

Unwanted behaviour of the accumulatingFiredPanes aggregation mode

Use of discardingFiredPanes aggregation mode instead of accumulatingFiredPanes

Acceptance / Success Criteria

Lucidchart Diagrams

Activity

Stefan Wachter April 15, 2021 at 9:00 AM

Stefan Wachter January 28, 2021 at 4:38 PM

Jesse White January 28, 2021 at 4:21 PM

DetailsAssigneeStefan WachterStefan WachterReporterStefan WachterStefan WachterSprintNone+7Fix versions28.0.0PriorityMinorParentNMS-12925 Visualize Netflow traffic by QoS

Details

Assignee

Reporter

Sprint

Fix versions

Priority

Parent

PagerDutyPagerDuty Incident

PagerDuty

Unwanted behaviour of the `accumulatingFiredPanes` aggregation mode

Use of `discardingFiredPanes` aggregation mode instead of `accumulatingFiredPanes`

Details
Assignee
Stefan Wachter
Reporter
Stefan Wachter
Sprint
None
Fix versions
28.0.0
Priority
Minor
Parent
NMS-12925 Visualize Netflow traffic by QoS

PagerDuty