Use composite aggregation instead of convo_key field from flow documents

Description

The 'convo_key' fields is currently used for grouping flow documents together that form a single "conversation". We should remove this field, and compute it dynamically using a [composite aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html] instead.

 

Example of field value:

 

 

netflow.convo_key

   

["WAGON",17,"1.1.1.1","10.145.145.114","domain"]

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Matthew Brooks April 30, 2019 at 2:26 PM

Looks like there is two other ways to do this, using a script at query time or using a copy_to field at index time.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation

Using copy_to doesn't solve the issue about knowing the format of a convo key at index time.

Using script looks like it will be much less performant and since we have many documents likely will not be adequate.

 

So based on what I've found it seems that there isn't a better solution compared to how we are currently doing it.

 

Matthew Brooks April 29, 2019 at 9:30 PM
Edited

This post describes a use case similar to ours where the recommended solution is to use terms aggregation with partitioning because a composite aggregation cannot do what is needed.

https://discuss.elastic.co/t/composite-aggregation-sorting/156936

Matthew Brooks April 29, 2019 at 9:21 PM
Edited

Here is the query I was trying. It selects the buckets as we want but it only performs the sorting on the buckets that it selected. So If I query for 5, I get 5 results which are ordered based on natural order of the field values, then those 5 results have the total bytes agg applied and are sorted by that.

GET _searchGET _search{ "size": 0, "query": { "bool": { "filter": [{ "range": { "netflow.first_switched": { "lte": 1556649962000, "format": "epoch_millis" } } }, { "range": { "netflow.last_switched": { "gte": 1556217962000, "format": "epoch_millis" } } } ] } }, "aggs": { "top_n": { "composite": { "size": 5, "sources": [{ "location": { "terms": { "field": "location" } } }, { "protocol": { "terms": { "field": "netflow.protocol" } } }, { "dst_addr": { "terms": { "field": "netflow.dst_addr" } } }, { "src_addr": { "terms": { "field": "netflow.src_addr" } } }, { "application": { "terms": { "field": "netflow.application" } } } ] }, "aggs": { "totalBytesSort": { "bucket_sort": { "sort": [{ "total_bytes": { "order": "desc" } }] } }, "total_bytes": { "sum": { "field": "netflow.bytes" } } } } }}

Matthew Brooks April 29, 2019 at 9:16 PM
Edited

I gave this a try and it appears that this won't work for our use case. The issue is that we we cannot use the total bytes sum aggregation the same way. We can only apply the sum to the total bytes of the N buckets that were already selected and the only way we can order the buckets we selected is by the natural order of fields we include in grouping the buckets by. Doing that is not useful.

So as far as I can tell there is no way, using composite aggregation, to ask for the top N "buckets" based on the aggregation yielding the sum of the total bytes. We can only sort on a value that is contained in the bucket.

Won't Fix

Details

Assignee

Reporter

Labels

Fix versions

Priority

PagerDuty

Created April 23, 2019 at 3:15 PM
Updated June 3, 2019 at 8:05 AM
Resolved April 30, 2019 at 2:28 PM

Flag notifications