Elasticsearch聚合分析的精准性 shard_size设置

小灰灰 2022-10-29 01:47 329阅读 0赞

衡量分布式统计算法的指标有3个:数据量、实时性和精准性。任何算法只能满足其中2个指标,ES为了数据的实时性,降低了聚合分析的精准性。由于ES的数据是分布在各个分片上的,coordinating节点无法获取数据的概览,ES提供了一个参数返回遗漏的term分组上的文档数,这个值越小精准度越高,为0表示结果是精准的。为了让统计数据是精准的,有两种方式:1 当数据量较少时,只设置一个分片;2 增加shard_size参数,直到term分组遗漏文档数为0,表示聚合结果是精准的。

  1. 初始化数据
  2. DELETE my_flights
  3. PUT my_flights
  4. {
  5. "settings": {
  6. "number_of_shards": 20
  7. },
  8. "mappings" : {
  9. "properties" : {
  10. "AvgTicketPrice" : {
  11. "type" : "float"
  12. },
  13. "Cancelled" : {
  14. "type" : "boolean"
  15. },
  16. "Carrier" : {
  17. "type" : "keyword"
  18. },
  19. "Dest" : {
  20. "type" : "keyword"
  21. },
  22. "DestAirportID" : {
  23. "type" : "keyword"
  24. },
  25. "DestCityName" : {
  26. "type" : "keyword"
  27. },
  28. "DestCountry" : {
  29. "type" : "keyword"
  30. },
  31. "DestLocation" : {
  32. "type" : "geo_point"
  33. },
  34. "DestRegion" : {
  35. "type" : "keyword"
  36. },
  37. "DestWeather" : {
  38. "type" : "keyword"
  39. },
  40. "DistanceKilometers" : {
  41. "type" : "float"
  42. },
  43. "DistanceMiles" : {
  44. "type" : "float"
  45. },
  46. "FlightDelay" : {
  47. "type" : "boolean"
  48. },
  49. "FlightDelayMin" : {
  50. "type" : "integer"
  51. },
  52. "FlightDelayType" : {
  53. "type" : "keyword"
  54. },
  55. "FlightNum" : {
  56. "type" : "keyword"
  57. },
  58. "FlightTimeHour" : {
  59. "type" : "keyword"
  60. },
  61. "FlightTimeMin" : {
  62. "type" : "float"
  63. },
  64. "Origin" : {
  65. "type" : "keyword"
  66. },
  67. "OriginAirportID" : {
  68. "type" : "keyword"
  69. },
  70. "OriginCityName" : {
  71. "type" : "keyword"
  72. },
  73. "OriginCountry" : {
  74. "type" : "keyword"
  75. },
  76. "OriginLocation" : {
  77. "type" : "geo_point"
  78. },
  79. "OriginRegion" : {
  80. "type" : "keyword"
  81. },
  82. "OriginWeather" : {
  83. "tykibana_sample_data_flightspe" : "keyword"
  84. },
  85. "dayOfWeek" : {
  86. "type" : "integer"
  87. },
  88. "timestamp" : {
  89. "type" : "date"
  90. }
  91. }
  92. }
  93. }
  94. 重建索引
  95. post _reindex
  96. {
  97. "source":{
  98. "index":"kibana_sample_data_flights"
  99. },
  100. "dest":{
  101. "index":"my_flights"
  102. }
  103. }
  104. 查询文档总数
  105. get kibana_sample_data_flights/_count
  106. 查询文档总数
  107. get my_flights/_count
  108. get kibana_sample_data_flights/_search
  109. get kibana_sample_data_flights/_search
  110. {
  111. "size":0,
  112. "aggs":{
  113. "weather":{
  114. "terms": {
  115. "field": "OriginWeather",
  116. "size": 5,
  117. "show_term_doc_count_error":true
  118. }
  119. }
  120. }
  121. }
  122. 增大shard_size,直到doc_count_error_upper_bound0 解决聚合不精准问题
  123. get my_flights/_search
  124. {
  125. "size":0,
  126. "aggs":{
  127. "weather":{
  128. "terms": {
  129. "field": "OriginWeather",
  130. "size": 5,
  131. "shard_size": 10,
  132. "show_term_doc_count_error":true
  133. }
  134. }
  135. }
  136. }

发表评论

表情:
评论列表 (有 0 条评论,329人围观)

还没有评论,来说两句吧...

相关阅读