linux 服务器故障排查

谁借莪1个温暖的怀抱¢ 2023-01-18 09:14 232阅读 0赞

linux 服务器故障排查

linux 服务器故障可从cpu、内存、磁盘、网络状况等几个方面进行排查

#

*********************

vmstat:整体性能监控

vmstat可查看进程运行、内存、swap、io、cpu等信息,用法如下

  1. [root@centos ~]# vmstat -h
  2. Usage:
  3. vmstat [options] [delay [count]]
  4. #vmstat:输出一次信息
  5. #vmstat delay:间隔delay秒,周期性的输出信息
  6. #vmstat delay count:间隔delay秒,共输出count次信息
  7. Options:
  8. -a, --active active/inactive memory
  9. -f, --forks number of forks since boot
  10. -m, --slabs slabinfo
  11. -n, --one-header do not redisplay header
  12. -s, --stats event counter statistics
  13. -d, --disk disk statistics
  14. -D, --disk-sum summarize disk statistics
  15. -p, --partition <dev> partition specific statistics
  16. -S, --unit <char> define display unit
  17. -w, --wide wide output
  18. -t, --timestamp show timestamp
  19. -h, --help display this help and exit
  20. -V, --version output version information and exit

输出信息说明

  1. [root@centos ~]# vmstat
  2. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
  3. r b swpd free buff cache si so bi bo in cs us sy id wa st
  4. 2 0 0 1551100 2108 168820 0 0 112 10 86 275 1 3 95 0 0
  5. #procs
  6. r:正在运行或者等待获取cpu的进程数 //若该值长期大于cpu数,表明cpu不足
  7. b:阻塞等待io的进程数
  8. #memory
  9. free:空闲内存
  10. buff:使用block管理的内存时,使用的缓存
  11. cache:使用page管理的内存时,使用的缓存
  12. #swap
  13. si:交换区写入内存的大小
  14. so:内存写入交换区大小 //si、so应为0,若长期不为0,说明内存空间不足
  15. #io
  16. bi:每秒读取的块数
  17. bo:每秒写入的块数 //bi+bo>1000,则cpu中wa会比较大,说明io存在瓶颈
  18. #system
  19. ininterupt):每秒中断数
  20. cscount/second):每秒上下文切换数 //in、cs越大,cpu中sy占用的cpu越大
  21. #cpu
  22. us:用户进程占用cpu //us长期大于50%,需要对进程优化,或者新增服务器部署
  23. sy:系统进程占用cpu //us+sy长期大于80%,说明cpu资源不足
  24. idcpu空闲时间
  25. wa:进程等待io占用时间 //超过20%,说明io等待严重
  26. st:虚拟机占用的cpu时间,一般为0

#

*********************

free:查看内存

free可查看剩余可用内存,用法如下:

  1. [root@centos ~]# free --help
  2. Usage:
  3. free [options]
  4. Options:
  5. -b, --bytes show output in bytes #以字节为单位显示内存
  6. -k, --kilo show output in kilobytes #以kb为单位显示内存
  7. -m, --mega show output in megabytes #以m为单位显示内存
  8. -g, --giga show output in gigabytes #以g为单位显示内存
  9. --tera show output in terabytes
  10. --peta show output in petabytes
  11. -h, --human show human-readable output #以可读方式显示内存
  12. --si use powers of 1000 not 1024
  13. -l, --lohi show detailed low and high memory statistics #显示高低内存
  14. -t, --total show total for RAM + swap #显示总内存:mem+swap
  15. -s N, --seconds N repeat printing every N seconds #周期性间隔n秒显示内存
  16. -c N, --count N repeat printing N times, then exit #显示N次内存后退出
  17. -w, --wide wide output
  18. --help display this help and exit
  19. -V, --version output version information and exit

示例

  1. [root@centos ~]# free -h
  2. total used free shared buff/cache available
  3. Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
  4. Swap: 6.0G 0B 6.0G
  5. #显示总内存
  6. [root@centos ~]# free -t
  7. total used free shared buff/cache available
  8. Mem: 1863224 141796 1549804 9740 171624 1538100
  9. Swap: 6291452 0 6291452
  10. Total: 8154676 141796 7841256
  11. [root@centos ~]# free -th
  12. total used free shared buff/cache available
  13. Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
  14. Swap: 6.0G 0B 6.0G
  15. Total: 7.8G 138M 7.5G
  16. #显示高低内存
  17. [root@centos ~]# free -lh
  18. total used free shared buff/cache available
  19. Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
  20. Low: 1.8G 305M 1.5G
  21. High: 0B 0B 0B
  22. Swap: 6.0G 0B 6.0G
  23. [root@centos ~]# free -lht
  24. total used free shared buff/cache available
  25. Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
  26. Low: 1.8G 305M 1.5G
  27. High: 0B 0B 0B
  28. Swap: 6.0G 0B 6.0G
  29. Total: 7.8G 138M 7.5G

*********************

top、uptime:查看cpu

uptime:查看cpu 1分钟、5分钟、15分钟的平均负载(cpu运行数)

  1. [root@centos ~]# uptime -h
  2. Usage:
  3. uptime [options]
  4. Options:
  5. -p, --pretty show uptime in pretty format
  6. -h, --help display this help and exit
  7. -s, --since system up since
  8. -V, --version output version information and exit
  9. #输出信息说明
  10. [root@centos ~]# uptime
  11. 18:21:32 up 1:13, 1 user, load average: 0.00, 0.01, 0.05
  12. 18:21:32:当前系统时间
  13. 1:13:系统运行1小时13分钟
  14. load average0.00, 0.01, 0.05:系统1分钟、5分钟、15分钟的平均负载为0.000.010.05

如果平均负载/cpu数:不超过3,表明cpu运行状况良好;长期大于5,表明cpu负载过高

top:实时查看系统负载、进程信息、cpu运行状况、内存状况等信息

  1. [root@centos ~]# top
  2. #系统负载,参照uptime
  3. top - 18:33:49 up 1:25, 1 user, load average: 0.00, 0.01, 0.05
  4. #进程信息:100个进程,2个正在运行、98个休眠、0个僵尸进程
  5. Tasks: 100 total, 2 running, 98 sleeping, 0 stopped, 0 zombie
  6. #cpu运行状况
  7. %Cpu(s): 0.0 us, 10.5 sy, 0.0 ni, 89.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
  8. us:用户进程占用0.0%
  9. sy:系统进程占用10.5%
  10. ni:用户进程空间内改变过优先级的进程占用0.0%
  11. id:空闲cpu占比89.5%
  12. wa:等待io的进程占用0.0%
  13. #内存和swap信息
  14. KiB Mem : 1863224 total, 1549640 free, 141960 used, 171624 buff/cache
  15. KiB Swap: 6291452 total, 6291452 free, 0 used. 1537936 avail Mem
  16. #进程运行状况
  17. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  18. 7626 root 20 0 161884 2160 1532 R 5.0 0.1 0:00.03 top
  19. 1 root 20 0 128008 6564 4148 S 0.0 0.4 0:03.53 systemd
  20. 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
  21. 3 root 20 0 0 0 0 S 0.0 0.0 0:00.32 ksoftirqd/0
  22. 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
  23. 7 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
  24. 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
  25. 9 root 20 0 0 0 0 R 0.0 0.0 0:01.66 rcu_sched
  26. 10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
  27. 11 root rt 0 0 0 0 S 0.0 0.0 0:00.13 watchdog/0
  28. 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
  29. 14 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
  30. 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
  31. 16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 writeback
  32. 17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kintegrityd
  33. 18 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 bioset

*********************

df、iostat:磁盘监控

df:查看磁盘剩余空间,用法如下:

  1. [root@centos ~]# df --help
  2. 用法:df [选项]... [文件]...
  3. Show information about the file system on which each FILE resides,
  4. or all file systems by default.
  5. Mandatory arguments to long options are mandatory for short options too.
  6. -a, --all include pseudo, duplicate, inaccessible file systems
  7. -B, --block-size=SIZE scale sizes by SIZE before printing them; e.g.,
  8. '-BM' prints sizes in units of 1,048,576 bytes;
  9. see SIZE format below
  10. --direct show statistics for a file instead of mount point
  11. --total produce a grand total
  12. -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G)
  13. -H, --si likewise, but use powers of 1000 not 1024
  14. -i, --inodes 显示inode 信息而非块使用量
  15. -k 即--block-size=1K
  16. -l, --local 只显示本机的文件系统
  17. --no-sync 取得使用量数据前不进行同步动作(默认)
  18. --output[=FIELD_LIST] use the output format defined by FIELD_LIST,
  19. or print all fields if FIELD_LIST is omitted.
  20. -P, --portability use the POSIX output format
  21. --sync invoke sync before getting usage info
  22. -t, --type=TYPE limit listing to file systems of type TYPE
  23. -T, --print-type print file system type
  24. -x, --exclude-type=TYPE limit listing to file systems not of type TYPE
  25. -v (ignored)
  26. --help 显示此帮助信息并退出
  27. --version 显示版本信息并退出

示例

  1. [root@centos ~]# df -h
  2. 文件系统 容量 已用 可用 已用% 挂载点
  3. /dev/mapper/centos-root 72G 20G 53G 27% /
  4. devtmpfs 898M 0 898M 0% /dev
  5. tmpfs 910M 0 910M 0% /dev/shm
  6. tmpfs 910M 9.5M 901M 2% /run
  7. tmpfs 910M 0 910M 0% /sys/fs/cgroup
  8. /dev/sda1 2.0G 146M 1.9G 8% /boot
  9. tmpfs 182M 0 182M 0% /run/user/0

若磁盘空间不足,找出最大的文件,若文件不再使用,可清空或者删除

du:查找最大的文件或者目录,用法如下:

  1. [root@centos ~]# du --help
  2. 用法:du [选项]... [文件]...
  3.  或:du [选项]... --files0-from=F
  4. Summarize disk usage of each FILE, recursively for directories.
  5. Mandatory arguments to long options are mandatory for short options too.
  6. -0, --null end each output line with 0 byte rather than newline
  7. -a, --all write counts for all files, not just directories
  8. --apparent-size print apparent sizes, rather than disk usage; although
  9. the apparent size is usually smaller, it may be
  10. larger due to holes in ('sparse') files, internal
  11. fragmentation, indirect blocks, and the like
  12. -B, --block-size=SIZE scale sizes by SIZE before printing them; e.g.,
  13. '-BM' prints sizes in units of 1,048,576 bytes;
  14. see SIZE format below
  15. -b, --bytes equivalent to '--apparent-size --block-size=1'
  16. -c, --total produce a grand total
  17. -D, --dereference-args dereference only symlinks that are listed on the
  18. command line
  19. -d, --max-depth=N print the total for a directory (or file, with --all)
  20. only if it is N or fewer levels below the command
  21. line argument; --max-depth=0 is the same as
  22. --summarize
  23. --files0-from=F summarize disk usage of the
  24. NUL-terminated file names specified in file F;
  25. if F is -, then read names from standard input
  26. -H equivalent to --dereference-args (-D)
  27. -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G)
  28. #以可读方式显示数据,不同的数据可根据大小以不同的单位显示
  29. --inodes list inode usage information instead of block usage
  30. -k like --block-size=1K #以k为单位显示数据
  31. -L, --dereference dereference all symbolic links
  32. -l, --count-links count sizes many times if hard linked
  33. -m like --block-size=1M #以m为单位显示数据
  34. -P, --no-dereference don't follow any symbolic links (this is the default)
  35. -S, --separate-dirs for directories do not include size of subdirectories
  36. --si like -h, but use powers of 1000 not 1024
  37. -s, --summarize display only a total for each argument
  38. -t, --threshold=SIZE exclude entries smaller than SIZE if positive,
  39. or entries greater than SIZE if negative
  40. --time show time of the last modification of any file in the
  41. directory, or any of its subdirectories
  42. --time=WORD show time as WORD instead of modification time:
  43. atime, access, use, ctime or status
  44. --time-style=STYLE show times using STYLE, which can be:
  45. full-iso, long-iso, iso, or +FORMAT;
  46. FORMAT is interpreted like in 'date'
  47. -X, --exclude-from=FILE exclude files that match any pattern in FILE
  48. --exclude=PATTERN exclude files that match PATTERN
  49. -x, --one-file-system skip directories on different file systems
  50. --help 显示此帮助信息并退出
  51. --version 显示版本信息并退出

示例

  1. #查找指定目录下最大目录:du -k --max-depth=1 / | sort -nr |head -5
  2. [root@centos ~]# du -k --max-depth=1|sort -nr -k 1|head -5
  3. 96320 .
  4. 96192 ./.m2
  5. 48 ./.docker
  6. 12 ./.ssh
  7. 0 ./.pki
  8. #找到当前目录下的大于1M的文件:find . -type f -size +1M
  9. [root@centos ~]# find . -type f -size +1M
  10. ./.m2/repository/org/apache/lucene/lucene-core/3.6.0/lucene-core-3.6.0.jar
  11. ./.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
  12. ./.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar
  13. ./.m2/repository/com/amazonaws/aws-java-sdk-iam/1.11.128/aws-java-sdk-iam-1.11.128.jar
  14. ./.m2/repository/com/amazonaws/aws-java-sdk-ec2/1.11.128/aws-java-sdk-ec2-1.11.128.jar
  15. ./.m2/repository/com/amazonaws/aws-java-sdk-dynamodb/1.11.128/aws-java-sdk-dynamodb-1.11.128.jar
  16. ./.m2/repository/com/amazonaws/aws-java-sdk-ssm/1.11.128/aws-java-sdk-ssm-1.11.128.jar
  17. ./.m2/repository/com/amazonaws/aws-java-sdk-api-gateway/1.11.128/aws-java-sdk-api-gateway-1.11.128.jar
  18. ./.m2/repository/com/amazonaws/aws-java-sdk-models/1.11.128/aws-java-sdk-models-1.11.128.jar
  19. ./.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.4/jackson-databind-2.7.4.jar
  20. 清空文件(适用于文件不需要保存,但正在写入):echo "">file
  21. 删除文件(适用于文件不需要保存,不在写入):rm -rf file

iostat:查看磁盘运行状态,用法如下:

  1. [root@centos ~]# iostat --help
  2. 用法: iostat [ 选项 ] [ <时间间隔> [ <次数> ] ]
  3. 选项:
  4. [ -c ] [ -d ] [ -h ] [ -k | -m ] [ -N ] [ -t ] [ -V ] [ -x ] [ -y ] [ -z ]
  5. [ -j { ID | LABEL | PATH | UUID | ... } ]
  6. [ [ -T ] -g <用户组名> ] [ -p [ <设备> [,...] | ALL ] ]
  7. [ <设备> [...] | ALL ]
  8. iostat -h:以可读方式显示数据
  9. iostat -m:以m为单位显示数据
  10. iostat -k:以k为单位显示数据
  11. iostat:输出一次数据
  12. iostat delay:每隔delay秒显示一次数据
  13. iostat delay count:每隔delay秒显示一次数据,输出count次后停止

centos安装命令:yum install sysstat

示例

  1. [root@centos ~]# iostat
  2. Linux 3.10.0-957.el7.x86_64 (centos) 20210503 _x86_64_ (1 CPU)
  3. avg-cpu: %user %nice %system %iowait %steal %idle
  4. 0.78 0.00 3.37 0.34 0.00 95.52
  5. Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
  6. sda 16.08 225.50 148.05 571180 374994
  7. scd0 0.01 0.41 0.00 1028 0
  8. dm-0 19.00 212.70 147.21 538751 372876
  9. dm-1 0.03 0.87 0.00 2204 0
  10. user:用户进程cpu使用量
  11. nice:改变过优先级的用户进程cpu使用量
  12. system:系统进程cpu使用量
  13. iowait:等待iocpu使用量 #如果该值超过20%,表明io存在瓶颈
  14. steal:进程管理另外一个虚拟机程序占用的cpu(一般为0
  15. idlecpu空闲的时间量
  16. tps:设备每秒传输次数
  17. kB_read/s:每秒读取数据(kB)
  18. kB_wrtn/s:每秒写入数据(kB)
  19. kB_read:读取的总数据量
  20. kB_wrtn:写入的总数据量

*********************

netstat:网络监控

netstat可对网络连接、端口等信息进行监控,用法如下:

  1. [root@centos ~]# netstat --help
  2. usage: netstat [-vWeenNcCF] [<Af>] -r netstat {-V|--version|-h|--help}
  3. netstat [-vWnNcaeol] [<Socket> ...]
  4. netstat { [-vWeenNac] -I[<Iface>] | [-veenNac] -i | [-cnNe] -M | -s [-6tuw] } [delay]
  5. -r, --route display routing table
  6. -I, --interfaces=<Iface> display interface table for <Iface>
  7. -i, --interfaces display interface table
  8. -g, --groups display multicast group memberships
  9. -s, --statistics display networking statistics (like SNMP)
  10. -M, --masquerade display masqueraded connections
  11. -v, --verbose be verbose
  12. -W, --wide don't truncate IP addresses
  13. -n, --numeric don't resolve names
  14. --numeric-hosts don't resolve host names
  15. --numeric-ports don't resolve port names
  16. --numeric-users don't resolve user names
  17. -N, --symbolic resolve hardware names
  18. -e, --extend display other/more information
  19. -p, --programs display PID/Program name for sockets
  20. -o, --timers display timers
  21. -c, --continuous continuous listing
  22. -l, --listening display listening server sockets
  23. -a, --all display all sockets (default: connected)
  24. -F, --fib display Forwarding Information Base (default)
  25. -C, --cache display routing cache instead of FIB
  26. -Z, --context display SELinux security context for sockets

常用参数

-a:列出所有连接(tcp、udp、unix)

-at:列出所有tcp连接

-au:列出所有udp连接

-ax:列出所有unix连接

-n:列出ip,不用讲ip解析为域名

-s:连接的数据包收发状况

-l:列出状态为listen或者listening的连接

示例

  1. [root@centos ~]# netstat -a|head -5
  2. Active Internet connections (servers and established)
  3. Proto Recv-Q Send-Q Local Address Foreign Address State
  4. tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN
  5. tcp 0 0 localhost:smtp 0.0.0.0:* LISTEN
  6. tcp 0 36 centos:ssh 192.168.57.1:57227 ESTABLISHED
  7. [root@centos ~]# netstat -an|head -5
  8. Active Internet connections (servers and established)
  9. Proto Recv-Q Send-Q Local Address Foreign Address State
  10. tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
  11. tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
  12. tcp 0 36 192.168.57.120:22 192.168.57.1:57227 ESTABLISHED
  13. Proto:连接协议类型,tcpudpunix
  14. Recv-Q:队列里面还没有读取的数据
  15. Send-Q:队列里面发送出去对方还未确认的数据
  16. #Recv-Q、Send-Q应该为0,可接受短时间内不为0,若长时间不为0,表示网络数据发生堆积
  17. Local Address:建立连接的本地地址、端口
  18. Foreign Address:建立连接的另外一方的地址、端口
  19. state:连接状态

查看服务器tcp连接数

  1. [root@centos ~]# netstat -a|awk '/^tcp/ {++S[$NF]} END {for(a in S) print a " " S[a] }'
  2. LISTEN 4
  3. ESTABLISHED 2

如果tcp连接数过多,检查应用程序是否可以优化,或者可考虑多部署几台服务器分散负载

如果网站遭遇恶意攻击,可找出访问频繁的ip封禁

  1. [root@centos ~]# cat access.log
  2. 172.17.1.2
  3. 172.17.1.1
  4. 172.18.1.1
  5. 172.16.1.1
  6. 172.16.1.1
  7. 172.16.1.1
  8. 172.17.1.2
  9. [root@centos ~]# awk '{print $1}' access.log|sort|uniq -c|sort -nr -k 1|more
  10. 3 172.16.1.1
  11. 2 172.17.1.2
  12. 1 172.18.1.1
  13. 1 172.17.1.1
  14. #172.16.1.1反问最频繁,如有需要可对该ip封禁

发表评论

表情:
评论列表 (有 0 条评论,232人围观)

还没有评论,来说两句吧...

相关阅读