Use composite aggregation instead of convo_key field from flow documents
Description
Acceptance / Success Criteria
Lucidchart Diagrams
Activity
Matthew Brooks April 30, 2019 at 2:26 PM
Looks like there is two other ways to do this, using a script at query time or using a copy_to field at index time.
Using copy_to doesn't solve the issue about knowing the format of a convo key at index time.
Using script looks like it will be much less performant and since we have many documents likely will not be adequate.
So based on what I've found it seems that there isn't a better solution compared to how we are currently doing it.
Matthew Brooks April 29, 2019 at 9:30 PMEdited
This post describes a use case similar to ours where the recommended solution is to use terms aggregation with partitioning because a composite aggregation cannot do what is needed.
https://discuss.elastic.co/t/composite-aggregation-sorting/156936
Matthew Brooks April 29, 2019 at 9:21 PMEdited
Here is the query I was trying. It selects the buckets as we want but it only performs the sorting on the buckets that it selected. So If I query for 5, I get 5 results which are ordered based on natural order of the field values, then those 5 results have the total bytes agg applied and are sorted by that.
GET _searchGET _search{ "size": 0, "query": { "bool": { "filter": [{ "range": { "netflow.first_switched": { "lte": 1556649962000, "format": "epoch_millis" } } }, { "range": { "netflow.last_switched": { "gte": 1556217962000, "format": "epoch_millis" } } } ] } }, "aggs": { "top_n": { "composite": { "size": 5, "sources": [{ "location": { "terms": { "field": "location" } } }, { "protocol": { "terms": { "field": "netflow.protocol" } } }, { "dst_addr": { "terms": { "field": "netflow.dst_addr" } } }, { "src_addr": { "terms": { "field": "netflow.src_addr" } } }, { "application": { "terms": { "field": "netflow.application" } } } ] }, "aggs": { "totalBytesSort": { "bucket_sort": { "sort": [{ "total_bytes": { "order": "desc" } }] } }, "total_bytes": { "sum": { "field": "netflow.bytes" } } } } }}
Matthew Brooks April 29, 2019 at 9:16 PMEdited
I gave this a try and it appears that this won't work for our use case. The issue is that we we cannot use the total bytes sum aggregation the same way. We can only apply the sum to the total bytes of the N buckets that were already selected and the only way we can order the buckets we selected is by the natural order of fields we include in grouping the buckets by. Doing that is not useful.
So as far as I can tell there is no way, using composite aggregation, to ask for the top N "buckets" based on the aggregation yielding the sum of the total bytes. We can only sort on a value that is contained in the bucket.
Details
Assignee
Matthew BrooksMatthew BrooksReporter
Jesse WhiteJesse WhiteLabels
Fix versions
Priority
Major
Details
Details
Assignee
Reporter
Labels
Fix versions
Priority
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

The 'convo_key' fields is currently used for grouping flow documents together that form a single "conversation". We should remove this field, and compute it dynamically using a [composite aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html] instead.
Example of field value:
netflow.convo_key
["WAGON",17,"1.1.1.1","10.145.145.114","domain"]