Elasticsearch 5.x 字段折叠的查询去重使用

痛定思痛。 2024-04-17 15:02 80阅读 0赞

1. 前言

Elasticsearch 5.x 有一个字段折叠(Field Collapsing,#22337)的功能非常有意思,在这里分享一下,

字段折叠是一个很有历史的需求了,可以看这个 issue,编号#256,最初是2010年7月提的issue,也是讨论最多的帖子之一(240+评论),熬了6年才支持的特性,你说牛不牛,哈哈。

该特性将于5.3发布。

So,什么是字段折叠,可以理解就是按特定字段进行合并去重,比如我们有一个菜谱搜索,我希望按菜谱的“菜系”字段进行折叠,即返回结果每个菜系都返回一个结果,也就是按菜系去重,我搜索关键字“鱼”,要去返回的结果里面各种菜系都有,有湘菜,有粤菜,有中餐,有西餐,别全是湘菜,就是这个意思,通过按特定字段折叠之后,来丰富搜索结果的多样性。

说到这里,有人肯定会想到,使用 term agg+ top hits agg 来实现啊,这种组合两种聚和的方式可以实现上面的功能,不过也有一些局限性,比如,不能分页,#4915;结果不够精确(top term+top hits,es 的聚合实现选择了牺牲精度来提高速度);数据量大的情况下,聚合比较慢,影响搜索体验。

而新的的字段折叠的方式是怎么实现的的呢,有这些要点:

  • 折叠+取 inner_hits 分两阶段执行(组合聚合的方式只有一个阶段),所以 top hits 永远是精确的。
  • 字段折叠只在 top hits 层执行,不需要每次都在完整的结果集上对为每个折叠主键计算实际的 doc values 值,只对 top hits 这小部分数据操作就可以,和 term agg 相比要节省很多内存。
  • 因为只在 top hits 上进行折叠,所以相比组合聚合的方式,速度要快很多。
  • 折叠 top docs 不需要使用全局序列(global ordinals)来转换 string,相比 agg 这也节省了很多内存。
  • 分页成为可能,和常规搜索一样,具有相同的局限,先获取 from+size 的内容,再合并。
  • 折叠只影响搜索结果,不影响聚合,搜索结果的 total 是所有的命中纪录数,去重的结果数未知(无法计算)。

2. 使用说明

下面来看看具体的例子,就知道怎么回事了,使用起来很简单。

1) 先准备索引和数据,这里以菜谱为例,name:菜谱名,type 为菜系,rating 为用户的累积平均评分

  1. DELETE recipes
  2. PUT recipes
  3. POST recipes/type/_mapping
  4. {
  5. "properties": {
  6. "name":{
  7. "type": "text"
  8. },
  9. "rating":{
  10. "type": "float"
  11. },"type":{
  12. "type": "keyword"
  13. }
  14. }
  15. }
  16. POST recipes/type/
  17. {
  18. "name":"清蒸鱼头","rating":1,"type":"湘菜"
  19. }
  20. POST recipes/type/
  21. {
  22. "name":"剁椒鱼头","rating":2,"type":"湘菜"
  23. }
  24. POST recipes/type/
  25. {
  26. "name":"红烧鲫鱼","rating":3,"type":"湘菜"
  27. }
  28. POST recipes/type/
  29. {
  30. "name":"鲫鱼汤(辣)","rating":3,"type":"湘菜"
  31. }
  32. POST recipes/type/
  33. {
  34. "name":"鲫鱼汤(微辣)","rating":4,"type":"湘菜"
  35. }
  36. POST recipes/type/
  37. {
  38. "name":"鲫鱼汤(变态辣)","rating":5,"type":"湘菜"
  39. }
  40. POST recipes/type/
  41. {
  42. "name":"广式鲫鱼汤","rating":5,"type":"粤菜"
  43. }
  44. POST recipes/type/
  45. {
  46. "name":"鱼香肉丝","rating":2,"type":"川菜"
  47. }
  48. POST recipes/type/
  49. {
  50. "name":"奶油鲍鱼汤","rating":2,"type":"西菜"
  51. }

2) 现在我们看看普通的查询效果是怎么样的,搜索关键字带“鱼”的菜,返回3条数据

  1. POST recipes/type/_search
  2. {
  3. "query": {"match": {
  4. "name": "鱼"
  5. }},"size": 3
  6. }

3) 全是湘菜,我的天,最近上火不想吃辣,这个第一页的结果对我来说就是垃圾,如下

  1. {
  2. "took": 2,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": 0.26742277,
  12. "hits": [
  13. {
  14. "_index": "recipes",
  15. "_type": "type",
  16. "_id": "AVoESHYF_OA-dG63Txsd",
  17. "_score": 0.26742277,
  18. "_source": {
  19. "name": "鲫鱼汤(变态辣)",
  20. "rating": 5,
  21. "type": "湘菜"
  22. }
  23. },
  24. {
  25. "_index": "recipes",
  26. "_type": "type",
  27. "_id": "AVoESHXO_OA-dG63Txsa",
  28. "_score": 0.19100356,
  29. "_source": {
  30. "name": "红烧鲫鱼",
  31. "rating": 3,
  32. "type": "湘菜"
  33. }
  34. },
  35. {
  36. "_index": "recipes",
  37. "_type": "type",
  38. "_id": "AVoESHWy_OA-dG63TxsZ",
  39. "_score": 0.19100356,
  40. "_source": {
  41. "name": "剁椒鱼头",
  42. "rating": 2,
  43. "type": "湘菜"
  44. }
  45. }
  46. ]
  47. }
  48. }

我们再看看,这次我想加个评分排序,大家都喜欢的是那些,看看有没有喜欢吃的,执行查询:

  1. POST recipes/type/_search
  2. {
  3. "query": {"match": {
  4. "name": "鱼"
  5. }},"sort": [
  6. {
  7. "rating": {
  8. "order": "desc"
  9. }
  10. }
  11. ],"size": 3
  12. }

结果稍微好点了,不过3个里面2个是湘菜,还是有点不合适,结果如下:

  1. {
  2. "took": 1,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": null,
  12. "hits": [
  13. {
  14. "_index": "recipes",
  15. "_type": "type",
  16. "_id": "AVoESHYF_OA-dG63Txsd",
  17. "_score": null,
  18. "_source": {
  19. "name": "鲫鱼汤(变态辣)",
  20. "rating": 5,
  21. "type": "湘菜"
  22. },
  23. "sort": [
  24. 5
  25. ]
  26. },
  27. {
  28. "_index": "recipes",
  29. "_type": "type",
  30. "_id": "AVoESHYW_OA-dG63Txse",
  31. "_score": null,
  32. "_source": {
  33. "name": "广式鲫鱼汤",
  34. "rating": 5,
  35. "type": "粤菜"
  36. },
  37. "sort": [
  38. 5
  39. ]
  40. },
  41. {
  42. "_index": "recipes",
  43. "_type": "type",
  44. "_id": "AVoESHX7_OA-dG63Txsc",
  45. "_score": null,
  46. "_source": {
  47. "name": "鲫鱼汤(微辣)",
  48. "rating": 4,
  49. "type": "湘菜"
  50. },
  51. "sort": [
  52. 4
  53. ]
  54. }
  55. ]
  56. }
  57. }

现在我知道了,我要看看其他菜系,这家不是还有西餐、广东菜等各种菜系的么,来来,帮我每个菜系来一个菜看看,换 terms agg 先得到唯一的 term 的 bucket,再组合 top_hits agg,返回按评分排序的第一个 top hits,有点复杂,没关系,看下面的查询就知道了:

  1. GET recipes/type/_search
  2. {
  3. "query": {
  4. "match": {
  5. "name": "鱼"
  6. }
  7. },
  8. "sort": [
  9. {
  10. "rating": {
  11. "order": "desc"
  12. }
  13. }
  14. ],"aggs": {
  15. "type": {
  16. "terms": {
  17. "field": "type",
  18. "size": 10
  19. },"aggs": {
  20. "rated": {
  21. "top_hits": {
  22. "sort": [{
  23. "rating": {"order": "desc"}
  24. }],
  25. "size": 1
  26. }
  27. }
  28. }
  29. }
  30. },
  31. "size": 0,
  32. "from": 0
  33. }

看下面的结果,虽然 json 结构有点复杂,不过总算是我们想要的结果了,湘菜、粤菜、川菜、西菜都出来了,每样一个,不重样:

  1. {
  2. "took": 4,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": 0,
  12. "hits": []
  13. },
  14. "aggregations": {
  15. "type": {
  16. "doc_count_error_upper_bound": 0,
  17. "sum_other_doc_count": 0,
  18. "buckets": [
  19. {
  20. "key": "湘菜",
  21. "doc_count": 6,
  22. "rated": {
  23. "hits": {
  24. "total": 6,
  25. "max_score": null,
  26. "hits": [
  27. {
  28. "_index": "recipes",
  29. "_type": "type",
  30. "_id": "AVoESHYF_OA-dG63Txsd",
  31. "_score": null,
  32. "_source": {
  33. "name": "鲫鱼汤(变态辣)",
  34. "rating": 5,
  35. "type": "湘菜"
  36. },
  37. "sort": [
  38. 5
  39. ]
  40. }
  41. ]
  42. }
  43. }
  44. },
  45. {
  46. "key": "川菜",
  47. "doc_count": 1,
  48. "rated": {
  49. "hits": {
  50. "total": 1,
  51. "max_score": null,
  52. "hits": [
  53. {
  54. "_index": "recipes",
  55. "_type": "type",
  56. "_id": "AVoESHYr_OA-dG63Txsf",
  57. "_score": null,
  58. "_source": {
  59. "name": "鱼香肉丝",
  60. "rating": 2,
  61. "type": "川菜"
  62. },
  63. "sort": [
  64. 2
  65. ]
  66. }
  67. ]
  68. }
  69. }
  70. },
  71. {
  72. "key": "粤菜",
  73. "doc_count": 1,
  74. "rated": {
  75. "hits": {
  76. "total": 1,
  77. "max_score": null,
  78. "hits": [
  79. {
  80. "_index": "recipes",
  81. "_type": "type",
  82. "_id": "AVoESHYW_OA-dG63Txse",
  83. "_score": null,
  84. "_source": {
  85. "name": "广式鲫鱼汤",
  86. "rating": 5,
  87. "type": "粤菜"
  88. },
  89. "sort": [
  90. 5
  91. ]
  92. }
  93. ]
  94. }
  95. }
  96. },
  97. {
  98. "key": "西菜",
  99. "doc_count": 1,
  100. "rated": {
  101. "hits": {
  102. "total": 1,
  103. "max_score": null,
  104. "hits": [
  105. {
  106. "_index": "recipes",
  107. "_type": "type",
  108. "_id": "AVoESHY3_OA-dG63Txsg",
  109. "_score": null,
  110. "_source": {
  111. "name": "奶油鲍鱼汤",
  112. "rating": 2,
  113. "type": "西菜"
  114. },
  115. "sort": [
  116. 2
  117. ]
  118. }
  119. ]
  120. }
  121. }
  122. }
  123. ]
  124. }
  125. }
  126. }

上面的实现方法,前面已经说了,可以做,有局限性,那看看新的字段折叠法如何做到呢,查询如下,加一个 collapse 参数,指定对那个字段去重就行了,这里当然对菜系“type”字段进行去重了:

  1. GET recipes/type/_search
  2. {
  3. "query": {
  4. "match": {
  5. "name": "鱼"
  6. }
  7. },
  8. "collapse": {
  9. "field": "type"
  10. },
  11. "size": 3,
  12. "from": 0
  13. }

结果很理想嘛,命中结果还是熟悉的那个味道(和查询结果长的一样嘛),如下:

  1. {
  2. "took": 1,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": null,
  12. "hits": [
  13. {
  14. "_index": "recipes",
  15. "_type": "type",
  16. "_id": "AVoDNlRJ_OA-dG63TxpW",
  17. "_score": 0.018980097,
  18. "_source": {
  19. "name": "鲫鱼汤(微辣)",
  20. "rating": 4,
  21. "type": "湘菜"
  22. },
  23. "fields": {
  24. "type": [
  25. "湘菜"
  26. ]
  27. }
  28. },
  29. {
  30. "_index": "recipes",
  31. "_type": "type",
  32. "_id": "AVoDNlRk_OA-dG63TxpZ",
  33. "_score": 0.013813315,
  34. "_source": {
  35. "name": "鱼香肉丝",
  36. "rating": 2,
  37. "type": "川菜"
  38. },
  39. "fields": {
  40. "type": [
  41. "川菜"
  42. ]
  43. }
  44. },
  45. {
  46. "_index": "recipes",
  47. "_type": "type",
  48. "_id": "AVoDNlRb_OA-dG63TxpY",
  49. "_score": 0.0125863515,
  50. "_source": {
  51. "name": "广式鲫鱼汤",
  52. "rating": 5,
  53. "type": "粤菜"
  54. },
  55. "fields": {
  56. "type": [
  57. "粤菜"
  58. ]
  59. }
  60. }
  61. ]
  62. }
  63. }

我再试试翻页,把 from 改一下,现在返回了3条数据,from 改成3,新的查询如下:

  1. {
  2. "took": 1,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": null,
  12. "hits": [
  13. {
  14. "_index": "recipes",
  15. "_type": "type",
  16. "_id": "AVoDNlRw_OA-dG63Txpa",
  17. "_score": 0.012546891,
  18. "_source": {
  19. "name": "奶油鲍鱼汤",
  20. "rating": 2,
  21. "type": "西菜"
  22. },
  23. "fields": {
  24. "type": [
  25. "西菜"
  26. ]
  27. }
  28. }
  29. ]
  30. }
  31. }

上面的结果只有一条了,去重之后本来就只有4条数据,上面的工作正常,每个菜系只有一个菜啊,那我不乐意了,帮我每个菜系里面多返回几条,我好选菜啊,加上参数 inner_hits 来控制返回的条数,这里返回2条,按 rating 也排个序,新的查询构造如下:

  1. GET recipes/type/_search
  2. {
  3. "query": {
  4. "match": {
  5. "name": "鱼"
  6. }
  7. },
  8. "collapse": {
  9. "field": "type",
  10. "inner_hits": {
  11. "name": "top_rated",
  12. "size": 2,
  13. "sort": [
  14. {
  15. "rating": "desc"
  16. }
  17. ]
  18. }
  19. },
  20. "sort": [
  21. {
  22. "rating": {
  23. "order": "desc"
  24. }
  25. }
  26. ],
  27. "size": 2,
  28. "from": 0
  29. }

查询结果如下,完美:

  1. {
  2. "took": 1,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 9,
  11. "max_score": null,
  12. "hits": [
  13. {
  14. "_index": "recipes",
  15. "_type": "type",
  16. "_id": "AVoESHYF_OA-dG63Txsd",
  17. "_score": null,
  18. "_source": {
  19. "name": "鲫鱼汤(变态辣)",
  20. "rating": 5,
  21. "type": "湘菜"
  22. },
  23. "fields": {
  24. "type": [
  25. "湘菜"
  26. ]
  27. },
  28. "sort": [
  29. 5
  30. ],
  31. "inner_hits": {
  32. "top_rated": {
  33. "hits": {
  34. "total": 6,
  35. "max_score": null,
  36. "hits": [
  37. {
  38. "_index": "recipes",
  39. "_type": "type",
  40. "_id": "AVoESHYF_OA-dG63Txsd",
  41. "_score": null,
  42. "_source": {
  43. "name": "鲫鱼汤(变态辣)",
  44. "rating": 5,
  45. "type": "湘菜"
  46. },
  47. "sort": [
  48. 5
  49. ]
  50. },
  51. {
  52. "_index": "recipes",
  53. "_type": "type",
  54. "_id": "AVoESHX7_OA-dG63Txsc",
  55. "_score": null,
  56. "_source": {
  57. "name": "鲫鱼汤(微辣)",
  58. "rating": 4,
  59. "type": "湘菜"
  60. },
  61. "sort": [
  62. 4
  63. ]
  64. }
  65. ]
  66. }
  67. }
  68. }
  69. },
  70. {
  71. "_index": "recipes",
  72. "_type": "type",
  73. "_id": "AVoESHYW_OA-dG63Txse",
  74. "_score": null,
  75. "_source": {
  76. "name": "广式鲫鱼汤",
  77. "rating": 5,
  78. "type": "粤菜"
  79. },
  80. "fields": {
  81. "type": [
  82. "粤菜"
  83. ]
  84. },
  85. "sort": [
  86. 5
  87. ],
  88. "inner_hits": {
  89. "top_rated": {
  90. "hits": {
  91. "total": 1,
  92. "max_score": null,
  93. "hits": [
  94. {
  95. "_index": "recipes",
  96. "_type": "type",
  97. "_id": "AVoESHYW_OA-dG63Txse",
  98. "_score": null,
  99. "_source": {
  100. "name": "广式鲫鱼汤",
  101. "rating": 5,
  102. "type": "粤菜"
  103. },
  104. "sort": [
  105. 5
  106. ]
  107. }
  108. ]
  109. }
  110. }
  111. }
  112. }
  113. ]
  114. }
  115. }

发表评论

表情:
评论列表 (有 0 条评论,80人围观)

还没有评论,来说两句吧...

相关阅读

    相关 mysql 关于某字段查询

     使用distinct 和 group by 对于查询多字段,而只对一个字段去重是查不到正确值得,因为distinct 后面跟多字段,mysql会对只有这些子对完全重复才能去重