MySQL中MGR中SECONDARY节点磁盘满,导致mysqld进程被OOM Killed 淩亂°似流年 2022-09-12 05:47 104阅读 0赞 本文首发于 GreatSQL社区 微信公众号。 * 问题描述 * MySQL 8.0.26 测试过程 * disk full报告过程及何时被oom killed * 关注mysqld进程内存消耗变化 * GreatSQL 8.0.25测试过程 > 在MGR测试中,人为制造磁盘满问题后,节点被oom killed ## 问题描述 ## 在对MySQL 8.0.26 vs GreatSQL 8.0.25的对比测试过程中,有一个环节是人为制造磁盘满的场景,看看MGR是否还能正常响应请求。 在实测过程中,最后发现磁盘满的那个节点,持续时间足够久后,会因为内存消耗过大而最终被OS给OOM Kill。 这个问题我已报告BUG(\#104979),下面是该过程的详细记录。 首先,直接利用dd复制空文件填满磁盘。 ## MySQL 8.0.26 测试过程 ## ### disk full报告过程及何时被oom killed ### 来看下MySQL 8.0.26遇到disk full时日志都输出哪些内容: # 首次提示disk full的时刻是 09:44:10.052558,这时其实还能写入日志,只是不能写数据和binlog 2021-09-18T09:44:10.052558+08:00 10 [ERROR] [MY-000035] [Server] Disk is full writing './yejr-mgr3-relay-bin-group_replication_applier.000046' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs. 2021-09-18T09:44:10.052558+08:00 15 [ERROR] [MY-000035] [Server] Disk is full writing '/data/MySQL/binlog.000039' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs. 2021-09-18T09:54:10.109075+08:00 15 [ERROR] [MY-000035] [Server] Disk is full writing '/data/MySQL/binlog.000039' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs. 2021-09-18T09:54:10.109828+08:00 10 [ERROR] [MY-000035] [Server] Disk is full writing './yejr-mgr3-relay-bin-group_replication_applier.000046' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs. 2021-09-18T09:56:15.020870+08:00 152 [ERROR] [MY-010907] [Server] Error writing file '/data/MySQL/slow.log' (errno: 28 - No space left on device) # 最后一次提示disk full时是10:04:10.166349,这时候彻底无法写入日志了 2021-09-18T10:04:10.166349+08:00 15 [ERROR] [MY-000035] [Server] Disk is full writing '/data/MySQL/binlog.000039' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. 从disk full时刻开始,大约过了2.5小时,mysqld进程内存消耗持续上升,最终引发oom kill Sep 18 12:56:28 mgr3 kernel: docker-containe invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-500 ... Sep 18 12:56:29 mgr3 kernel: Out of memory: Kill process 9539 (mysqld) score 902 or sacrifice child Sep 18 12:56:29 mgr3 kernel: Killed process 9539 (mysqld), UID 27, total-vm:17020364kB, anon-rss:14644556kB, file-rss:0kB, shmem-rss:0kB 在这期间某个时刻抓到的待认证事务堆积,在被oom kill前实际不止这么多: +--------------------------------------+----------+ | id | trx_que | +--------------------------------------+----------+ | dbe4f563-1622-11ec-8cc8-525400e802e2 | 87863918 | +--------------------------------------+----------+ ### 关注mysqld进程内存消耗变化 ### 下面是mysqld进程内存消耗变化情况 # 一开始3G 9539 3144872 /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64/bin/mysqld --defaults-file=/data/MySQL/my.cnf ... # 涨到3.47G 9539 3641984 /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64/bin/mysqld --defaults-file=/data/MySQL/my.cnf ... # 涨到7G 9539 7416908 /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64/bin/mysqld --defaults-file=/data/MySQL/my.cnf ... # 不断增长,直至最后被oom killed前,大约飙到14G 9539 14638256 /usr/local/mysql-8.0.26-linux-glibc2.12-x86_64/bin/mysqld --defaults-file=/data/MySQL/my.cnf OS层oom-killer相关日志: # mysqld进程内存大约飙到14G Sep 18 12:56:29 mgr3 kernel: Out of memory: Kill process 9539 (mysqld) score 902 or sacrifice child Sep 18 12:56:29 mgr3 kernel: Killed process 9539 (mysqld), UID 27, total-vm:17020364kB, anon-rss:14644556kB, file-rss:0kB, shmem-rss:0kB ## GreatSQL 8.0.25测试过程 ## 作为对比,我用GreatSQL 8.0.25也做了同样的测试。 # 首次报告disk full 2021-09-18T23:07:49.264992+08:00 89 [ERROR] [MY-000035] [Server] Disk is full writing '/data/GreatSQL/binlog.000085' (OS errno 28 - N o space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs. 2021-09-18T23:07:49.264992+08:00 82 [ERROR] [MY-000035] [Server] Disk is full writing './yejr-mgr3-relay-bin-group_replication_applie r.000147' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 se cs. # 发出报错,提示因为disk full,无法写入data/binlog 2021-09-18T23:07:49.838301+08:00 84 [ERROR] [MY-011071] [Repl] Plugin group_replication reported: 'io full on data or binlog director y' # 本节点自动改为RO 2021-09-18T23:07:49.852475+08:00 84 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set i nto read only mode after an error was detected.' 2021-09-18T23:07:49.852633+08:00 0 [ERROR] [MY-011486] [Repl] Plugin group_replication reported: 'Message received while the plugin i s not ready, message discarded.' # 接下来准备退出集群了 2021-09-18T23:07:49.856907+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Re-using server node 0 host 17 2.16.16.53' 2021-09-18T23:07:49.856937+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Re-using server node 1 host 17 2.16.16.16' 2021-09-18T23:07:49.856950+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] pid 5211 Installed site start={dba4e34 1394386 1} boot_key={dba4e34 1394375 1} event_horizon=10 node 4294967295 chksum_node_list(&site->nodes) 704906340' 2021-09-18T23:07:55.380111+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Installing leave view.' 2021-09-18T23:07:55.380169+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] ::install_view():: No exchanged data' 2021-09-18T23:07:55.380188+08:00 0 [Note] [MY-011071] [Repl] Plugin group_replication reported: 'on_view_changed is called' # 发生view change,正式退出集群 2021-09-18T23:07:55.380242+08:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.' 2021-09-18T23:07:55.382254+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Rejecting this message. The gr oup communication engine has already stopped.' ...中间几条相同的日志 2021-09-18T23:07:55.382325+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Rejecting this message. The group communication engine has already stopped.' 2021-09-18T23:07:55.382333+08:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Rejecting this message. The group....日志没写完,磁盘彻底填满了 # 下面是第二天我清理磁盘空间后的新日志 2021-09-19T07:10:52.071942+08:00 82 [ERROR] [MY-013309] [Repl] Plugin group_replication reported: 'Transaction '1:38481943' does not exist on Group Replication consistency manager while receiving remote transaction prepare.' 2021-09-19T07:10:52.071990+08:00 82 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.' 2021-09-19T07:10:52.072032+08:00 82 [Warning] [MY-011646] [Repl] Plugin group_replication reported: 'Skipping leave operation: member already left the group.' 2021-09-19T07:10:52.072049+08:00 82 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.' 从日志详情中可以看到,当磁盘空间满了之后,GreatSQL会将那个节点主动退出集群,对整个集群的影响非常小。 此外,从集群退出后,也不会再接收认证事务了,所以也没发生内存持续暴涨最终被oom killed的情况,实际观察过程中发现内存反倒还下降了 # 还在集群中的内存消耗 5211 2790736 /usr/local/GreatSQL-8.0.25-15-Linux-glibc2.17-x86_64-minimal/bin/mysqld ... # 退出集群后的内存反倒降低了 5211 2801696 /usr/local/GreatSQL-8.0.25-15-Linux-glibc2.17-x86_64-minimal/bin/mysqld 5211 968876 /usr/local/GreatSQL-8.0.25-15-Linux-glibc2.17-x86_64-minimal/bin/mysqld ... # 此后内存一直保持这个值 5211 969172 /usr/local/GreatSQL-8.0.25-15-Linux-glibc2.17-x86_64-minimal/bin/mysqld 这样对比来看,GreatSQL的可靠性还真是可以的,官方的MySQL MGR的可靠性还有待进一步加强呀。 P.S,本文即将推送前,收到MySQL官方bug团队的回复,认为这不是一个bug,而应该优先解决磁盘满的问题。我补充回复说加个事务缓存上限阈值或许更合理,人继续傲娇的表示我应该先关注磁盘问题。。。 Enjoy MySQL & GreatSQL :) -------------------- **文章推荐:** * [面向金融级应用的GreatSQL正式开源][GreatSQL] * [Changes in GreatSQL 8.0.25 (2021-8-26)][Changes in GreatSQL 8.0.25 _2021-8-26] * [一周碎碎念,2021.9.12,之前说的MGR课程上线啦][2021.9.12_MGR] * [技术分享 | Update更新慢、死锁等问题的排查思路分享][_ Update] * [MySQL分区表的一个性能BUG][MySQL_BUG] * [《叶问》36期,MySQL最多只能用到128个逻辑CPU,是真的吗][36_MySQL_128_CPU] * [GreatSQL重磅特性,InnoDB并行并行查询优化测试][GreatSQL_InnoDB] * [在Linux下源码编译安装GreatSQL/MySQL][Linux_GreatSQL_MySQL] -------------------- 扫码加入GreatSQL/MGR交流**QQ群** ![1cf8d8e76b514b521da2361f15869796.png][] 点击文末“**阅读原文**”直达老叶专栏 [GreatSQL]: http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653936538&idx=1&sn=3017d9d51ea875e8d8c7c39fa6a74b17&chksm=bd3b41f08a4cc8e6554acaf7beef741d446cca01321c3eb5b0e494c8603351c9b8ac993547c3&scene=21#wechat_redirect [Changes in GreatSQL 8.0.25 _2021-8-26]: http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653936538&idx=2&sn=fb72b35d16e78fcb2e83a4ee72b38ccb&chksm=bd3b41f08a4cc8e65e832accd50aa8c1d2df9d54d0a770fd8316099f3aa6ee58372ea99a1b11&scene=21#wechat_redirect [2021.9.12_MGR]: http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653936554&idx=1&sn=cc64211e88d16d8c2cc08b3b19061251&chksm=bd3b41c08a4cc8d6c23c8ba7aa423aaa7b40739c1ffbeaa0e5623fb85c255f8b59305313f3cb&scene=21#wechat_redirect [_ Update]: https://blog.csdn.net/n88Lpo/article/details/120340200 [MySQL_BUG]: https://blog.csdn.net/yongqi_wang/article/details/86576006 [36_MySQL_128_CPU]: http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653936499&idx=1&sn=f2efa2f3ae920b74039b6a9afa820238&chksm=bd3b41198a4cc80f18e90a0ad1b5550efbf20791e9512cd6c43dec9a0689da0d54fab5754749&scene=21#wechat_redirect [GreatSQL_InnoDB]: https://blog.csdn.net/n88Lpo/article/details/120386624 [Linux_GreatSQL_MySQL]: http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653936221&idx=1&sn=27183d95857dabef6bdf0d7541c4da57&chksm=bd3b40378a4cc921828ecf10774787bbf268d693490acf8c151a2b5c830bac3658870e95b396&scene=21#wechat_redirect [1cf8d8e76b514b521da2361f15869796.png]: /images/20220828/f9508a0f69ad4e45aa3de012beb3fca7.png
还没有评论,来说两句吧...