When working with statistical aggregations in ElasticSearch 1.7 I couldn’t find any documentation about how arrays are treated.
Of course you need a numeric field for statistical aggregations.
In my special case I needed arrays of objects. But this should obviously not make a difference.
For statistical aggregations to work, you need to “seal” your mapping by setting dynamic to false.
Otherwise there might be NumberFormatExceptions.
Anyway, that being said let’s dive into a practical example.
The question:
How are arrays of numeric values (in this case objects), treated when it comes to statistical aggregations.
I didn’t know the answer and I couldn’t find any documentation, so I tested it:
Here’s the mapping:
POST /test
POST /test/obj/_mapping
{
"obj" : {
"dynamic" : false,
"properties" : {
"conf" : {
"type": "object",
"properties": {
"val" : {
"type" : "float"
}
}
}
}
}
}
The data:
POST /test/obj/1
{
"conf" : {
"val" : [1.011, 0.1237012, 2.8988, 1, 0]
}
}
POST /test/obj/2
{
"conf" : {
"val" : [0.123983, 1323.2381, 24.130821, 0.8230]
}
}
POST /test/obj/3
{
"conf" : {
"val" : [0,1,2,3]
}
}
And the stats-aggregation:
This will treat each element in the array individually:
GET test/obj/_search
{
"size": 0,
"aggs" : {
"confidence" : {
"stats" : {
"field" : "conf.val"
}
}
}
}
The total count is not equal to the number of documents, but to the sum of the cardinality of the arrays of the documents.
Result:
"aggregations": {
"confidence": {
"count": 13,
"min": 0,
"max": 1323.2381591796875,
"avg": 104.56534342754345,
"sum": 1359.349464558065
}
}
If you need some sort of aggregation, like SUM or MULTIPLY etc. you can use a script:
GET test/obj/_search
{
"size": 0,
"aggs" : {
"confidence" : {
"stats" : {
"script" : "score = 0; for (el in doc['conf.val']) {score += el}; return score;"
}
}
}
}
Result:
As we explicitly pre-aggregate the values, the document count is now equal to the number of documents from our query (default – match_all – index_size)
"aggregations": {
"confidence": {
"count": 3,
"min": 5.033501133322716,
"max": 1348.3159634247422,
"avg": 453.1164881860216,
"sum": 1359.349464558065
}
}
To aggregate over max-values:
GET test/obj/_search
{
"size": 0,
"aggs" : {
"confidence" : {
"stats" : {
"script" : "score = 0; for (el in doc['conf.val']) {if (el > score) {score = el}}; return score;"
}
}
}
}
Result:
"aggregations": {
"confidence": {
"count": 3,
"min": 2.8987998962402344,
"max": 1323.2381591796875,
"avg": 443.04565302530926,
"sum": 1329.1369590759277
}
}
This all also works with the even more useful extended_stats-aggregation:
Result form the MAX-value aggregation script:
"aggregations": {
"confidence": {
"count": 3,
"min": 2.8987998962402344,
"max": 1323.2381591796875,
"avg": 443.04565302530926,
"sum": 1329.1369590759277,
"sum_of_squares": 1750976.6289500864,
"variance": 387369.42565207276,
"std_deviation": 622.3900912225971,
"std_deviation_bounds": {
"upper": 1687.8258354705035,
"lower": -801.7345294198849
}
}
}
MIN-values with extended stats aggregation:
GET test/obj/_search
{
"size": 0,
"aggs": {
"confidence": {
"extended_stats": {
"script": "score = -1; for (el in doc['conf.val']) {if (score == -1) {score = el}; else if (el < score) {score = el}}; return score;"
}
}
}
}
Result:
"aggregations": {
"confidence": {
"count": 3,
"min": 0,
"max": 0.1239830031991005,
"avg": 0.0413276677330335,
"sum": 0.1239830031991005,
"sum_of_squares": 0.015371785082268163,
"variance": 0.0034159522405040367,
"std_deviation": 0.05844614820930492,
"std_deviation_bounds": {
"upper": 0.15821996415164333,
"lower": -0.07556462868557634
}
}
Lessons learned:
- arrays are treated as individual “documents” when it comes to aggregations.
- This behaviour might not be useful for all usecases
- How numerical arrays are treated in aggregations can be overwritten and customized with scripts