Statistical aggregations on numeric object array fields

When working with statistical aggregations in ElasticSearch 1.7 I couldn’t find any documentation about how arrays are treated.

Of course you need a numeric field for statistical aggregations.
In my special case I needed arrays of objects. But this should obviously not make a difference.

For statistical aggregations to work, you need to “seal” your mapping by setting dynamic to false.

Otherwise there might be NumberFormatExceptions.

Anyway, that being said let’s dive into a practical example.

The question:

How are arrays of numeric values (in this case objects), treated when it comes to statistical aggregations.

I didn’t know the answer and I couldn’t find any documentation, so I tested it:

Here’s the mapping:

POST /test
POST /test/obj/_mapping
{
    "obj" : {
        "dynamic" : false,
        "properties" : {
            "conf" : {
                "type": "object",
                "properties": {
                    "val" : {
                        "type" : "float"
                    }
                }
            }
        }
    }
}

The data:

POST /test/obj/1
{
    "conf" : {
        "val" : [1.011, 0.1237012, 2.8988, 1, 0]
    }
}

POST /test/obj/2
{
    "conf" : {
        "val" : [0.123983, 1323.2381, 24.130821, 0.8230]
    }
}

POST /test/obj/3
{
    "conf" : {
        "val" : [0,1,2,3]
    }
}

And the stats-aggregation:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-extendedstats-aggregation.html

This will treat each element in the array individually:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "field" : "conf.val"
            }
        }
    }
}

The total count is not equal to the number of documents, but to the sum of the cardinality of the arrays of the documents.

Result:

"aggregations": {
      "confidence": {
         "count": 13,
         "min": 0,
         "max": 1323.2381591796875,
         "avg": 104.56534342754345,
         "sum": 1359.349464558065
      }
}

If you need some sort of aggregation, like SUM or MULTIPLY etc. you can use a script:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "script" : "score = 0; for (el in doc['conf.val']) {score += el}; return score;"
            }
        }
    }
}

Result:

As we explicitly pre-aggregate the values, the document count is now equal to the number of documents from our query (default – match_all – index_size)

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 5.033501133322716,
         "max": 1348.3159634247422,
         "avg": 453.1164881860216,
         "sum": 1359.349464558065
      }
}

To aggregate over max-values:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "script" : "score = 0; for (el in doc['conf.val']) {if (el > score) {score = el}}; return score;"
            }
        }
    }
}

Result:
"aggregations": {
      "confidence": {
         "count": 3,
         "min": 2.8987998962402344,
         "max": 1323.2381591796875,
         "avg": 443.04565302530926,
         "sum": 1329.1369590759277
      }
}

This all also works with the even more useful extended_stats-aggregation:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-extendedstats-aggregation.html

Result form the MAX-value aggregation script:

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 2.8987998962402344,
         "max": 1323.2381591796875,
         "avg": 443.04565302530926,
         "sum": 1329.1369590759277,
         "sum_of_squares": 1750976.6289500864,
         "variance": 387369.42565207276,
         "std_deviation": 622.3900912225971,
         "std_deviation_bounds": {
            "upper": 1687.8258354705035,
            "lower": -801.7345294198849
         }
      }
}

MIN-values with extended stats aggregation:

GET test/obj/_search
{
   "size": 0,
   "aggs": {
      "confidence": {
         "extended_stats": {
            "script": "score = -1; for (el in doc['conf.val']) {if (score == -1) {score = el}; else if (el < score) {score = el}}; return score;"
         }
      }
   }
}

Result:

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 0,
         "max": 0.1239830031991005,
         "avg": 0.0413276677330335,
         "sum": 0.1239830031991005,
         "sum_of_squares": 0.015371785082268163,
         "variance": 0.0034159522405040367,
         "std_deviation": 0.05844614820930492,
         "std_deviation_bounds": {
            "upper": 0.15821996415164333,
            "lower": -0.07556462868557634
         }
}

Lessons learned:

arrays are treated as individual “documents” when it comes to aggregations.
This behaviour might not be useful for all usecases
How numerical arrays are treated in aggregations can be overwritten and customized with scripts

Saskia Vola

Textmining, NLP and Elasticsearch consulting

Statistical aggregations on numeric object array fields

Leave a Reply