comment 0

Statistical aggregations on numeric object array fields

When working with statistical aggregations in ElasticSearch 1.7 I couldn’t find any documentation about how arrays are treated.

Of course you need a numeric field for statistical aggregations.
In my special case I needed arrays of objects. But this should obviously not make a difference.

For statistical aggregations to work, you need to “seal” your mapping by setting dynamic to false.

Otherwise there might be NumberFormatExceptions.

Anyway, that being said let’s dive into a practical example.

The question:

How are arrays of numeric values (in this case objects), treated when it comes to statistical aggregations.

I didn’t know the answer and I couldn’t find any documentation, so I tested it:

Here’s the mapping:

POST /test
POST /test/obj/_mapping
{
    "obj" : {
        "dynamic" : false,
        "properties" : {
            "conf" : {
                "type": "object",
                "properties": {
                    "val" : {
                        "type" : "float"
                    }
                }
            }
        }
    }
}

The data:

POST /test/obj/1
{
    "conf" : {
        "val" : [1.011, 0.1237012, 2.8988, 1, 0]
    }
}

POST /test/obj/2
{
    "conf" : {
        "val" : [0.123983, 1323.2381, 24.130821, 0.8230]
    }
}

POST /test/obj/3
{
    "conf" : {
        "val" : [0,1,2,3]
    }
}

And the stats-aggregation:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-extendedstats-aggregation.html

This will treat each element in the array individually:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "field" : "conf.val"
            }
        }
    }
}

The total count is not equal to the number of documents, but to the sum of the cardinality of the arrays of the documents.

Result:

"aggregations": {
      "confidence": {
         "count": 13,
         "min": 0,
         "max": 1323.2381591796875,
         "avg": 104.56534342754345,
         "sum": 1359.349464558065
      }
}

If you need some sort of aggregation, like SUM or MULTIPLY etc. you can use a script:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "script" : "score = 0; for (el in doc['conf.val']) {score += el}; return score;"
            }
        }
    }
}

Result:

As we explicitly pre-aggregate the values, the document count is now equal to the number of documents from our query (default – match_all – index_size)

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 5.033501133322716,
         "max": 1348.3159634247422,
         "avg": 453.1164881860216,
         "sum": 1359.349464558065
      }
}

To aggregate over max-values:

GET test/obj/_search
{
    "size": 0, 
    "aggs" : {
        "confidence" : {
            "stats" : {
                "script" : "score = 0; for (el in doc['conf.val']) {if (el > score) {score = el}}; return score;"
            }
        }
    }
}

Result:
"aggregations": {
      "confidence": {
         "count": 3,
         "min": 2.8987998962402344,
         "max": 1323.2381591796875,
         "avg": 443.04565302530926,
         "sum": 1329.1369590759277
      }
}

This all also works with the even more useful extended_stats-aggregation:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-extendedstats-aggregation.html

Result form the MAX-value aggregation script:

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 2.8987998962402344,
         "max": 1323.2381591796875,
         "avg": 443.04565302530926,
         "sum": 1329.1369590759277,
         "sum_of_squares": 1750976.6289500864,
         "variance": 387369.42565207276,
         "std_deviation": 622.3900912225971,
         "std_deviation_bounds": {
            "upper": 1687.8258354705035,
            "lower": -801.7345294198849
         }
      }
}

MIN-values with extended stats aggregation:

GET test/obj/_search
{
   "size": 0,
   "aggs": {
      "confidence": {
         "extended_stats": {
            "script": "score = -1; for (el in doc['conf.val']) {if (score == -1) {score = el}; else if (el < score) {score = el}}; return score;"
         }
      }
   }
}

Result:

"aggregations": {
      "confidence": {
         "count": 3,
         "min": 0,
         "max": 0.1239830031991005,
         "avg": 0.0413276677330335,
         "sum": 0.1239830031991005,
         "sum_of_squares": 0.015371785082268163,
         "variance": 0.0034159522405040367,
         "std_deviation": 0.05844614820930492,
         "std_deviation_bounds": {
            "upper": 0.15821996415164333,
            "lower": -0.07556462868557634
         }
}

Lessons learned:

  • arrays are treated as individual “documents” when it comes to aggregations.
  • This behaviour might not be useful for all usecases
  • How numerical arrays are treated in aggregations can be overwritten and customized with scripts

Leave a Reply