Elasticsearch聚合分析的精准性 shard_size设置
衡量分布式统计算法的指标有3个:数据量、实时性和精准性。任何算法只能满足其中2个指标,ES为了数据的实时性,降低了聚合分析的精准性。由于ES的数据是分布在各个分片上的,coordinating节点无法获取数据的概览,ES提供了一个参数返回遗漏的term分组上的文档数,这个值越小精准度越高,为0表示结果是精准的。为了让统计数据是精准的,有两种方式:1 当数据量较少时,只设置一个分片;2 增加shard_size参数,直到term分组遗漏文档数为0,表示聚合结果是精准的。
初始化数据
DELETE my_flights
PUT my_flights
{
"settings": {
"number_of_shards": 20
},
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
"type" : "keyword"
},
"DestCountry" : {
"type" : "keyword"
},
"DestLocation" : {
"type" : "geo_point"
},
"DestRegion" : {
"type" : "keyword"
},
"DestWeather" : {
"type" : "keyword"
},
"DistanceKilometers" : {
"type" : "float"
},
"DistanceMiles" : {
"type" : "float"
},
"FlightDelay" : {
"type" : "boolean"
},
"FlightDelayMin" : {
"type" : "integer"
},
"FlightDelayType" : {
"type" : "keyword"
},
"FlightNum" : {
"type" : "keyword"
},
"FlightTimeHour" : {
"type" : "keyword"
},
"FlightTimeMin" : {
"type" : "float"
},
"Origin" : {
"type" : "keyword"
},
"OriginAirportID" : {
"type" : "keyword"
},
"OriginCityName" : {
"type" : "keyword"
},
"OriginCountry" : {
"type" : "keyword"
},
"OriginLocation" : {
"type" : "geo_point"
},
"OriginRegion" : {
"type" : "keyword"
},
"OriginWeather" : {
"tykibana_sample_data_flightspe" : "keyword"
},
"dayOfWeek" : {
"type" : "integer"
},
"timestamp" : {
"type" : "date"
}
}
}
}
重建索引
post _reindex
{
"source":{
"index":"kibana_sample_data_flights"
},
"dest":{
"index":"my_flights"
}
}
查询文档总数
get kibana_sample_data_flights/_count
查询文档总数
get my_flights/_count
get kibana_sample_data_flights/_search
get kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"weather":{
"terms": {
"field": "OriginWeather",
"size": 5,
"show_term_doc_count_error":true
}
}
}
}
增大shard_size,直到doc_count_error_upper_bound为0 解决聚合不精准问题
get my_flights/_search
{
"size":0,
"aggs":{
"weather":{
"terms": {
"field": "OriginWeather",
"size": 5,
"shard_size": 10,
"show_term_doc_count_error":true
}
}
}
}
还没有评论,来说两句吧...