K8S(14)监控实战-grafana出图_alert告警

系统管理员 2022-12-03 01:50 385阅读 0赞

k8s监控实战-grafana出图_alert告警

目录

  • k8s监控实战-grafana出图_alert告警

    • 1 使用炫酷的grafana出图

      • 1.1 部署grafana

        • 1.1.1 准备镜像
        • 1.1.2 准备rbac资源清单
        • 1.1.3 准备dp资源清单
        • 1.1.4 准备svc资源清单
        • 1.1.5 准备ingress资源清单
        • 1.1.6 域名解析
        • 1.1.7 应用资源配置清单
      • 1.2 使用grafana出图

        • 1.2.1 浏览器访问验证
        • 1.2.2 进入容器安装插件
        • 1.2.3 配置数据源
        • 1.2.4 添加K8S集群信息
        • 1.2.5 查看k8s集群数据和图表
    • 2 配置alert告警插件

      • 2.1 部署alert插件

        • 2.1.1 准备docker镜像
        • 2.1.2 准备cm资源清单
        • 2.1.3 准备dp资源清单
        • 2.1.4 准备svc资源清单
        • 2.1.5 应用资源配置清单
      • 2.2 K8S使用alert报警

        • 2.2.1 k8s创建基础报警规则文件
        • 2.2.2 K8S 更新配置
        • 2.2.3 测试告警

1 使用炫酷的grafana出图

prometheus的dashboard虽然号称拥有多种多样的图表,但是实在太简陋了,一般都用专业的grafana工具来出图
grafana官方dockerhub地址
grafana官方github地址
grafana官网

1.1 部署grafana

1.1.1 准备镜像

  1. docker pull grafana/grafana:5.4.2
  2. docker tag 6f18ddf9e552 harbor.zq.com/infra/grafana:v5.4.2
  3. docker push harbor.zq.com/infra/grafana:v5.4.2

准备目录

  1. mkdir /data/k8s-yaml/grafana
  2. cd /data/k8s-yaml/grafana

1.1.2 准备rbac资源清单

  1. cat >rbac.yaml <<'EOF'
  2. apiVersion: rbac.authorization.k8s.io/v1
  3. kind: ClusterRole
  4. metadata:
  5. labels:
  6. addonmanager.kubernetes.io/mode: Reconcile
  7. kubernetes.io/cluster-service: "true"
  8. name: grafana
  9. rules:
  10. - apiGroups:
  11. - "*"
  12. resources:
  13. - namespaces
  14. - deployments
  15. - pods
  16. verbs:
  17. - get
  18. - list
  19. - watch
  20. ---
  21. apiVersion: rbac.authorization.k8s.io/v1
  22. kind: ClusterRoleBinding
  23. metadata:
  24. labels:
  25. addonmanager.kubernetes.io/mode: Reconcile
  26. kubernetes.io/cluster-service: "true"
  27. name: grafana
  28. roleRef:
  29. apiGroup: rbac.authorization.k8s.io
  30. kind: ClusterRole
  31. name: grafana
  32. subjects:
  33. - kind: User
  34. name: k8s-node
  35. EOF

1.1.3 准备dp资源清单

  1. cat >dp.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Deployment
  4. metadata:
  5. labels:
  6. app: grafana
  7. name: grafana
  8. name: grafana
  9. namespace: infra
  10. spec:
  11. progressDeadlineSeconds: 600
  12. replicas: 1
  13. revisionHistoryLimit: 7
  14. selector:
  15. matchLabels:
  16. name: grafana
  17. strategy:
  18. rollingUpdate:
  19. maxSurge: 1
  20. maxUnavailable: 1
  21. type: RollingUpdate
  22. template:
  23. metadata:
  24. labels:
  25. app: grafana
  26. name: grafana
  27. spec:
  28. containers:
  29. - name: grafana
  30. image: harbor.zq.com/infra/grafana:v5.4.2
  31. imagePullPolicy: IfNotPresent
  32. ports:
  33. - containerPort: 3000
  34. protocol: TCP
  35. volumeMounts:
  36. - mountPath: /var/lib/grafana
  37. name: data
  38. imagePullSecrets:
  39. - name: harbor
  40. securityContext:
  41. runAsUser: 0
  42. volumes:
  43. - nfs:
  44. server: hdss7-200
  45. path: /data/nfs-volume/grafana
  46. name: data
  47. EOF

创建frafana数据目录

  1. mkdir /data/nfs-volume/grafana

1.1.4 准备svc资源清单

  1. cat >svc.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. name: grafana
  6. namespace: infra
  7. spec:
  8. ports:
  9. - port: 3000
  10. protocol: TCP
  11. targetPort: 3000
  12. selector:
  13. app: grafana
  14. EOF

1.1.5 准备ingress资源清单

  1. cat >ingress.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Ingress
  4. metadata:
  5. name: grafana
  6. namespace: infra
  7. spec:
  8. rules:
  9. - host: grafana.zq.com
  10. http:
  11. paths:
  12. - path: /
  13. backend:
  14. serviceName: grafana
  15. servicePort: 3000
  16. EOF

1.1.6 域名解析

  1. vi /var/named/zq.com.zone
  2. grafana A 10.4.7.10

systemctl restart named

1.1.7 应用资源配置清单

  1. kubectl apply -f http://k8s-yaml.zq.com/grafana/rbac.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/grafana/dp.yaml
  3. kubectl apply -f http://k8s-yaml.zq.com/grafana/svc.yaml
  4. kubectl apply -f http://k8s-yaml.zq.com/grafana/ingress.yaml

1.2 使用grafana出图

1.2.1 浏览器访问验证

访问http://grafana.zq.com,默认用户名密码admin/admin
能成功访问表示安装成功
进入后立即修改管理员密码为admin123

1.2.2 进入容器安装插件

grafana确认启动好以后,需要进入grafana容器内部,安装以下插件

  1. kubectl -n infra exec -it grafana-d6588db94-xr4s6 /bin/bash
  2. # 以下命令在容器内执行
  3. grafana-cli plugins install grafana-kubernetes-app
  4. grafana-cli plugins install grafana-clock-panel
  5. grafana-cli plugins install grafana-piechart-panel
  6. grafana-cli plugins install briangann-gauge-panel
  7. grafana-cli plugins install natel-discrete-panel

1.2.3 配置数据源

添加数据源,依次点击:左侧锯齿图标—>add data source—>Prometheus
mark

添加完成后重启grafana

  1. kubectl -n infra delete pod grafana-7dd95b4c8d-nj5cx

1.2.4 添加K8S集群信息

启用K8S插件,依次点击:左侧锯齿图标—>Plugins—>kubernetes—>Enable
新建cluster,依次点击:左侧K8S图标—>New Cluster
mark

1.2.5 查看k8s集群数据和图表

添加完需要稍等几分钟,在没有取到数据之前,会报http forbidden,没关系,等一会就好。大概2-5分钟。
mark

点击Cluster Dashboard
mark

2 配置alert告警插件

2.1 部署alert插件

2.1.1 准备docker镜像

  1. docker pull docker.io/prom/alertmanager:v0.14.0
  2. docker tag 23744b2d645c harbor.zq.com/infra/alertmanager:v0.14.0
  3. docker push harbor.zq.com/infra/alertmanager:v0.14.0

准备目录

  1. mkdir /data/k8s-yaml/alertmanager
  2. cd /data/k8s-yaml/alertmanager

2.1.2 准备cm资源清单

  1. cat >cm.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: ConfigMap
  4. metadata:
  5. name: alertmanager-config
  6. namespace: infra
  7. data:
  8. config.yml: |-
  9. global:
  10. # 在没有报警的情况下声明为已解决的时间
  11. resolve_timeout: 5m
  12. # 配置邮件发送信息
  13. smtp_smarthost: 'smtp.163.com:25'
  14. smtp_from: 'xxx@163.com'
  15. smtp_auth_username: 'xxx@163.com'
  16. smtp_auth_password: 'xxxxxx'
  17. smtp_require_tls: false
  18. templates:
  19. - '/etc/alertmanager/*.tmpl'
  20. # 所有报警信息进入后的根路由,用来设置报警的分发策略
  21. route:
  22. # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
  23. group_by: ['alertname', 'cluster']
  24. # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
  25. group_wait: 30s
  26. # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
  27. group_interval: 5m
  28. # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
  29. repeat_interval: 5m
  30. # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
  31. receiver: default
  32. receivers:
  33. - name: 'default'
  34. email_configs:
  35. - to: 'xxxx@qq.com'
  36. send_resolved: true
  37. html: '{
  38. { template "email.to.html" . }}'
  39. headers: { Subject: " {
  40. { .CommonLabels.instance }} {
  41. { .CommonAnnotations.summary }}" }

email.tmpl: |
{ { define “email.to.html” }}
{ {- if gt (len .Alerts.Firing) 0 -}}
{ { range .Alerts }}
告警程序: prometheus_alert

告警级别: { { .Labels.severity }}

告警类型: { { .Labels.alertname }}

故障主机: { { .Labels.instance }}

告警主题: { { .Annotations.summary }}

触发时间: { { .StartsAt.Format “2006-01-02 15:04:05” }}

{ { end }}{ { end -}}

  1. {
  2. {- if gt (len .Alerts.Resolved) 0 -}}
  3. {
  4. { range .Alerts }}
  5. 告警程序: prometheus_alert <br>
  6. 告警级别: {
  7. { .Labels.severity }} <br>
  8. 告警类型: {
  9. { .Labels.alertname }} <br>
  10. 故障主机: {
  11. { .Labels.instance }} <br>
  12. 告警主题: {
  13. { .Annotations.summary }} <br>
  14. 触发时间: {
  15. { .StartsAt.Format "2006-01-02 15:04:05" }} <br>
  16. 恢复时间: {
  17. { .EndsAt.Format "2006-01-02 15:04:05" }} <br>
  18. {
  19. { end }}{
  20. { end -}}
  21. {
  22. {- end }}

EOF

2.1.3 准备dp资源清单

  1. cat >dp.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Deployment
  4. metadata:
  5. name: alertmanager
  6. namespace: infra
  7. spec:
  8. replicas: 1
  9. selector:
  10. matchLabels:
  11. app: alertmanager
  12. template:
  13. metadata:
  14. labels:
  15. app: alertmanager
  16. spec:
  17. containers:
  18. - name: alertmanager
  19. image: harbor.zq.com/infra/alertmanager:v0.14.0
  20. args:
  21. - "--config.file=/etc/alertmanager/config.yml"
  22. - "--storage.path=/alertmanager"
  23. ports:
  24. - name: alertmanager
  25. containerPort: 9093
  26. volumeMounts:
  27. - name: alertmanager-cm
  28. mountPath: /etc/alertmanager
  29. volumes:
  30. - name: alertmanager-cm
  31. configMap:
  32. name: alertmanager-config
  33. imagePullSecrets:
  34. - name: harbor
  35. EOF

2.1.4 准备svc资源清单

  1. cat >svc.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. name: alertmanager
  6. namespace: infra
  7. spec:
  8. selector:
  9. app: alertmanager
  10. ports:
  11. - port: 80
  12. targetPort: 9093
  13. EOF

2.1.5 应用资源配置清单

  1. kubectl apply -f http://k8s-yaml.zq.com/alertmanager/cm.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/alertmanager/dp.yaml
  3. kubectl apply -f http://k8s-yaml.zq.com/alertmanager/svc.yaml

2.2 K8S使用alert报警

2.2.1 k8s创建基础报警规则文件

  1. cat >/data/nfs-volume/prometheus/etc/rules.yml <<'EOF'
  2. groups:
  3. - name: hostStatsAlert
  4. rules:
  5. - alert: hostCpuUsageAlert
  6. expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
  7. for: 5m
  8. labels:
  9. severity: warning
  10. annotations:
  11. summary: "{
  12. { $labels.instance }} CPU usage above 85% (current value: {
  13. { $value }}%)"
  14. - alert: hostMemUsageAlert
  15. expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
  16. for: 5m
  17. labels:
  18. severity: warning
  19. annotations:
  20. summary: "{
  21. { $labels.instance }} MEM usage above 85% (current value: {
  22. { $value }}%)"
  23. - alert: OutOfInodes
  24. expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
  25. for: 5m
  26. labels:
  27. severity: warning
  28. annotations:
  29. summary: "Out of inodes (instance {
  30. { $labels.instance }})"
  31. description: "Disk is almost running out of available inodes (< 10% left) (current value: {
  32. { $value }})"
  33. - alert: OutOfDiskSpace
  34. expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
  35. for: 5m
  36. labels:
  37. severity: warning
  38. annotations:
  39. summary: "Out of disk space (instance {
  40. { $labels.instance }})"
  41. description: "Disk is almost full (< 10% left) (current value: {
  42. { $value }})"
  43. - alert: UnusualNetworkThroughputIn
  44. expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
  45. for: 5m
  46. labels:
  47. severity: warning
  48. annotations:
  49. summary: "Unusual network throughput in (instance {
  50. { $labels.instance }})"
  51. description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {
  52. { $value }})"
  53. - alert: UnusualNetworkThroughputOut
  54. expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
  55. for: 5m
  56. labels:
  57. severity: warning
  58. annotations:
  59. summary: "Unusual network throughput out (instance {
  60. { $labels.instance }})"
  61. description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {
  62. { $value }})"
  63. - alert: UnusualDiskReadRate
  64. expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
  65. for: 5m
  66. labels:
  67. severity: warning
  68. annotations:
  69. summary: "Unusual disk read rate (instance {
  70. { $labels.instance }})"
  71. description: "Disk is probably reading too much data (> 50 MB/s) (current value: {
  72. { $value }})"
  73. - alert: UnusualDiskWriteRate
  74. expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
  75. for: 5m
  76. labels:
  77. severity: warning
  78. annotations:
  79. summary: "Unusual disk write rate (instance {
  80. { $labels.instance }})"
  81. description: "Disk is probably writing too much data (> 50 MB/s) (current value: {
  82. { $value }})"
  83. - alert: UnusualDiskReadLatency
  84. expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
  85. for: 5m
  86. labels:
  87. severity: warning
  88. annotations:
  89. summary: "Unusual disk read latency (instance {
  90. { $labels.instance }})"
  91. description: "Disk latency is growing (read operations > 100ms) (current value: {
  92. { $value }})"
  93. - alert: UnusualDiskWriteLatency
  94. expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
  95. for: 5m
  96. labels:
  97. severity: warning
  98. annotations:
  99. summary: "Unusual disk write latency (instance {
  100. { $labels.instance }})"
  101. description: "Disk latency is growing (write operations > 100ms) (current value: {
  102. { $value }})"
  103. - name: http_status
  104. rules:
  105. - alert: ProbeFailed
  106. expr: probe_success == 0
  107. for: 1m
  108. labels:
  109. severity: error
  110. annotations:
  111. summary: "Probe failed (instance {
  112. { $labels.instance }})"
  113. description: "Probe failed (current value: {
  114. { $value }})"
  115. - alert: StatusCode
  116. expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
  117. for: 1m
  118. labels:
  119. severity: error
  120. annotations:
  121. summary: "Status Code (instance {
  122. { $labels.instance }})"
  123. description: "HTTP status code is not 200-399 (current value: {
  124. { $value }})"
  125. - alert: SslCertificateWillExpireSoon
  126. expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  127. for: 5m
  128. labels:
  129. severity: warning
  130. annotations:
  131. summary: "SSL certificate will expire soon (instance {
  132. { $labels.instance }})"
  133. description: "SSL certificate expires in 30 days (current value: {
  134. { $value }})"
  135. - alert: SslCertificateHasExpired
  136. expr: probe_ssl_earliest_cert_expiry - time() <= 0
  137. for: 5m
  138. labels:
  139. severity: error
  140. annotations:
  141. summary: "SSL certificate has expired (instance {
  142. { $labels.instance }})"
  143. description: "SSL certificate has expired already (current value: {
  144. { $value }})"
  145. - alert: BlackboxSlowPing
  146. expr: probe_icmp_duration_seconds > 2
  147. for: 5m
  148. labels:
  149. severity: warning
  150. annotations:
  151. summary: "Blackbox slow ping (instance {
  152. { $labels.instance }})"
  153. description: "Blackbox ping took more than 2s (current value: {
  154. { $value }})"
  155. - alert: BlackboxSlowRequests
  156. expr: probe_http_duration_seconds > 2
  157. for: 5m
  158. labels:
  159. severity: warning
  160. annotations:
  161. summary: "Blackbox slow requests (instance {
  162. { $labels.instance }})"
  163. description: "Blackbox request took more than 2s (current value: {
  164. { $value }})"
  165. - alert: PodCpuUsagePercent
  166. expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
  167. for: 5m
  168. labels:
  169. severity: warning
  170. annotations:
  171. summary: "Pod cpu usage percent has exceeded 80% (current value: {
  172. { $value }}%)"
  173. EOF

2.2.2 K8S 更新配置

在prometheus配置文件中追加配置:

  1. cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'
  2. alerting:
  3. alertmanagers:
  4. - static_configs:
  5. - targets: ["alertmanager"]
  6. rule_files:
  7. - "/data/etc/rules.yml"
  8. EOF

重载配置:

  1. curl -X POST http://prometheus.zq.com/-/reload

mark

以上这些就是我们的告警规则

2.2.3 测试告警

把test命名空间里的dubbo-demo-service给停掉

blackbox里信息已报错,alert里面项目变黄了
mark
等到alert中项目变为红色的时候就开会发邮件告警
mark

如果需要自己定制告警规则和告警内容,需要研究一下promql,自己修改配置文件。

发表评论

表情:
评论列表 (有 0 条评论,385人围观)

还没有评论,来说两句吧...

相关阅读