linux 服务器故障排查-蒲公英云

linux 服务器故障排查

linux 服务器故障可从cpu、内存、磁盘、网络状况等几个方面进行排查

#

*********************

vmstat：整体性能监控

vmstat可查看进程运行、内存、swap、io、cpu等信息，用法如下

[root@centos ~]# vmstat -h
Usage:
 vmstat [options] [delay [count]]
#vmstat：输出一次信息
#vmstat delay：间隔delay秒，周期性的输出信息
#vmstat delay count：间隔delay秒，共输出count次信息
Options:
 -a, --active           active/inactive memory
 -f, --forks            number of forks since boot
 -m, --slabs            slabinfo
 -n, --one-header       do not redisplay header
 -s, --stats            event counter statistics
 -d, --disk             disk statistics
 -D, --disk-sum         summarize disk statistics
 -p, --partition <dev>  partition specific statistics
 -S, --unit <char>      define display unit
 -w, --wide             wide output
 -t, --timestamp        show timestamp
 -h, --help     display this help and exit
 -V, --version  output version information and exit

输出信息说明

[root@centos ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 1551100   2108 168820    0    0   112    10   86  275  1  3 95  0  0
#procs
r：正在运行或者等待获取cpu的进程数        //若该值长期大于cpu数，表明cpu不足
b：阻塞等待io的进程数
#memory
free：空闲内存
buff：使用block管理的内存时，使用的缓存
cache：使用page管理的内存时，使用的缓存
#swap
si：交换区写入内存的大小
so：内存写入交换区大小                   //si、so应为0，若长期不为0，说明内存空间不足
#io
bi：每秒读取的块数
bo：每秒写入的块数                       //bi+bo>1000，则cpu中wa会比较大，说明io存在瓶颈
#system
in（interupt）：每秒中断数
cs（count/second）：每秒上下文切换数      //in、cs越大，cpu中sy占用的cpu越大
#cpu
us：用户进程占用cpu                      //us长期大于50%，需要对进程优化，或者新增服务器部署
sy：系统进程占用cpu                      //us+sy长期大于80%，说明cpu资源不足
id：cpu空闲时间
wa：进程等待io占用时间                   //超过20%，说明io等待严重
st：虚拟机占用的cpu时间，一般为0

#

*********************

free：查看内存

free可查看剩余可用内存，用法如下：

[root@centos ~]# free --help
Usage:
 free [options]
Options:
 -b, --bytes         show output in bytes         #以字节为单位显示内存
 -k, --kilo          show output in kilobytes     #以kb为单位显示内存
 -m, --mega          show output in megabytes     #以m为单位显示内存
 -g, --giga          show output in gigabytes     #以g为单位显示内存
     --tera          show output in terabytes     
     --peta          show output in petabytes     
 -h, --human         show human-readable output   #以可读方式显示内存
     --si            use powers of 1000 not 1024
 -l, --lohi          show detailed low and high memory statistics  #显示高低内存
 -t, --total         show total for RAM + swap                     #显示总内存：mem+swap
 -s N, --seconds N   repeat printing every N seconds     #周期性间隔n秒显示内存
 -c N, --count N     repeat printing N times, then exit  #显示N次内存后退出
 -w, --wide          wide output
     --help     display this help and exit
 -V, --version  output version information and exit

示例

[root@centos ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           1.8G        138M        1.5G        9.5M        167M        1.5G
Swap:          6.0G          0B        6.0G
#显示总内存
[root@centos ~]# free -t
              total        used        free      shared  buff/cache   available
Mem:        1863224      141796     1549804        9740      171624     1538100
Swap:       6291452           0     6291452
Total:      8154676      141796     7841256
[root@centos ~]# free -th
              total        used        free      shared  buff/cache   available
Mem:           1.8G        138M        1.5G        9.5M        167M        1.5G
Swap:          6.0G          0B        6.0G
Total:         7.8G        138M        7.5G
#显示高低内存
[root@centos ~]# free -lh
              total        used        free      shared  buff/cache   available
Mem:           1.8G        138M        1.5G        9.5M        167M        1.5G
Low:           1.8G        305M        1.5G
High:            0B          0B          0B
Swap:          6.0G          0B        6.0G
[root@centos ~]# free -lht
              total        used        free      shared  buff/cache   available
Mem:           1.8G        138M        1.5G        9.5M        167M        1.5G
Low:           1.8G        305M        1.5G
High:            0B          0B          0B
Swap:          6.0G          0B        6.0G
Total:         7.8G        138M        7.5G

*********************

top、uptime：查看cpu

uptime：查看cpu 1分钟、5分钟、15分钟的平均负载（cpu运行数）

[root@centos ~]# uptime -h
Usage:
 uptime [options]
Options:
 -p, --pretty   show uptime in pretty format
 -h, --help     display this help and exit
 -s, --since    system up since
 -V, --version  output version information and exit
#输出信息说明
[root@centos ~]# uptime
 18:21:32 up  1:13,  1 user,  load average: 0.00, 0.01, 0.05
18:21:32：当前系统时间
1:13：系统运行1小时13分钟
load average：0.00, 0.01, 0.05：系统1分钟、5分钟、15分钟的平均负载为0.00、0.01、0.05

如果平均负载/cpu数：不超过3，表明cpu运行状况良好；长期大于5，表明cpu负载过高

top：实时查看系统负载、进程信息、cpu运行状况、内存状况等信息

[root@centos ~]# top
#系统负载，参照uptime
top - 18:33:49 up  1:25,  1 user,  load average: 0.00, 0.01, 0.05
#进程信息：100个进程，2个正在运行、98个休眠、0个僵尸进程
Tasks: 100 total,   2 running,  98 sleeping,   0 stopped,   0 zombie     
#cpu运行状况
%Cpu(s):  0.0 us, 10.5 sy,  0.0 ni, 89.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
us：用户进程占用0.0%
sy：系统进程占用10.5%
ni：用户进程空间内改变过优先级的进程占用0.0%
id：空闲cpu占比89.5%
wa：等待io的进程占用0.0%
#内存和swap信息
KiB Mem :  1863224 total,  1549640 free,   141960 used,   171624 buff/cache
KiB Swap:  6291452 total,  6291452 free,        0 used.  1537936 avail Mem 
#进程运行状况
   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                          
  7626 root      20   0  161884   2160   1532 R  5.0  0.1   0:00.03 top                                                                              
     1 root      20   0  128008   6564   4148 S  0.0  0.4   0:03.53 systemd                                                                          
     2 root      20   0       0      0      0 S  0.0  0.0   0:00.01 kthreadd                                                                         
     3 root      20   0       0      0      0 S  0.0  0.0   0:00.32 ksoftirqd/0                                                                      
     5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H                                                                     
     7 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0                                                                      
     8 root      20   0       0      0      0 S  0.0  0.0   0:00.00 rcu_bh                                                                           
     9 root      20   0       0      0      0 R  0.0  0.0   0:01.66 rcu_sched                                                                        
    10 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 lru-add-drain                                                                    
    11 root      rt   0       0      0      0 S  0.0  0.0   0:00.13 watchdog/0                                                                       
    13 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kdevtmpfs                                                                        
    14 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 netns                                                                            
    15 root      20   0       0      0      0 S  0.0  0.0   0:00.00 khungtaskd                                                                       
    16 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 writeback                                                                        
    17 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kintegrityd                                                                      
    18 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 bioset

*********************

df、iostat：磁盘监控

df：查看磁盘剩余空间，用法如下：

[root@centos ~]# df --help
用法：df [选项]... [文件]...
Show information about the file system on which each FILE resides,
or all file systems by default.
Mandatory arguments to long options are mandatory for short options too.
  -a, --all             include pseudo, duplicate, inaccessible file systems
  -B, --block-size=SIZE  scale sizes by SIZE before printing them; e.g.,
                           '-BM' prints sizes in units of 1,048,576 bytes;
                           see SIZE format below
      --direct          show statistics for a file instead of mount point
      --total           produce a grand total
  -h, --human-readable  print sizes in human readable format (e.g., 1K 234M 2G)
  -H, --si              likewise, but use powers of 1000 not 1024
  -i, --inodes        显示inode 信息而非块使用量
  -k            即--block-size=1K
  -l, --local        只显示本机的文件系统
      --no-sync        取得使用量数据前不进行同步动作(默认)
      --output[=FIELD_LIST]  use the output format defined by FIELD_LIST,
                               or print all fields if FIELD_LIST is omitted.
  -P, --portability     use the POSIX output format
      --sync            invoke sync before getting usage info
  -t, --type=TYPE       limit listing to file systems of type TYPE
  -T, --print-type      print file system type
  -x, --exclude-type=TYPE   limit listing to file systems not of type TYPE
  -v                    (ignored)
      --help        显示此帮助信息并退出
      --version        显示版本信息并退出

示例

[root@centos ~]# df -h
文件系统                 容量  已用  可用 已用% 挂载点
/dev/mapper/centos-root   72G   20G   53G   27% /
devtmpfs                 898M     0  898M    0% /dev
tmpfs                    910M     0  910M    0% /dev/shm
tmpfs                    910M  9.5M  901M    2% /run
tmpfs                    910M     0  910M    0% /sys/fs/cgroup
/dev/sda1                2.0G  146M  1.9G    8% /boot
tmpfs                    182M     0  182M    0% /run/user/0

若磁盘空间不足，找出最大的文件，若文件不再使用，可清空或者删除

du：查找最大的文件或者目录，用法如下：

[root@centos ~]# du --help
用法：du [选项]... [文件]...
　或：du [选项]... --files0-from=F
Summarize disk usage of each FILE, recursively for directories.
Mandatory arguments to long options are mandatory for short options too.
  -0, --null            end each output line with 0 byte rather than newline
  -a, --all             write counts for all files, not just directories
      --apparent-size   print apparent sizes, rather than disk usage; although
                          the apparent size is usually smaller, it may be
                          larger due to holes in ('sparse') files, internal
                          fragmentation, indirect blocks, and the like
  -B, --block-size=SIZE  scale sizes by SIZE before printing them; e.g.,
                           '-BM' prints sizes in units of 1,048,576 bytes;
                           see SIZE format below
  -b, --bytes           equivalent to '--apparent-size --block-size=1'
  -c, --total           produce a grand total
  -D, --dereference-args  dereference only symlinks that are listed on the
                          command line
  -d, --max-depth=N     print the total for a directory (or file, with --all)
                          only if it is N or fewer levels below the command
                          line argument;  --max-depth=0 is the same as
                          --summarize
      --files0-from=F   summarize disk usage of the
                          NUL-terminated file names specified in file F;
                          if F is -, then read names from standard input
  -H                    equivalent to --dereference-args (-D)
  -h, --human-readable  print sizes in human readable format (e.g., 1K 234M 2G)
                                                #以可读方式显示数据，不同的数据可根据大小以不同的单位显示
      --inodes          list inode usage information instead of block usage
  -k                    like --block-size=1K    #以k为单位显示数据
  -L, --dereference     dereference all symbolic links
  -l, --count-links     count sizes many times if hard linked
  -m                    like --block-size=1M    #以m为单位显示数据
  -P, --no-dereference  don't follow any symbolic links (this is the default)
  -S, --separate-dirs   for directories do not include size of subdirectories
      --si              like -h, but use powers of 1000 not 1024
  -s, --summarize       display only a total for each argument
  -t, --threshold=SIZE  exclude entries smaller than SIZE if positive,
                          or entries greater than SIZE if negative
      --time            show time of the last modification of any file in the
                          directory, or any of its subdirectories
      --time=WORD       show time as WORD instead of modification time:
                          atime, access, use, ctime or status
      --time-style=STYLE  show times using STYLE, which can be:
                            full-iso, long-iso, iso, or +FORMAT;
                            FORMAT is interpreted like in 'date'
  -X, --exclude-from=FILE  exclude files that match any pattern in FILE
      --exclude=PATTERN    exclude files that match PATTERN
  -x, --one-file-system    skip directories on different file systems
      --help        显示此帮助信息并退出
      --version        显示版本信息并退出

示例

#查找指定目录下最大目录：du -k --max-depth=1 / | sort -nr |head -5
[root@centos ~]# du -k --max-depth=1|sort -nr -k 1|head -5
96320    .
96192    ./.m2
48    ./.docker
12    ./.ssh
0    ./.pki
#找到当前目录下的大于1M的文件：find . -type f -size +1M
[root@centos ~]# find . -type f -size +1M
./.m2/repository/org/apache/lucene/lucene-core/3.6.0/lucene-core-3.6.0.jar
./.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
./.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar
./.m2/repository/com/amazonaws/aws-java-sdk-iam/1.11.128/aws-java-sdk-iam-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-ec2/1.11.128/aws-java-sdk-ec2-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-dynamodb/1.11.128/aws-java-sdk-dynamodb-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-ssm/1.11.128/aws-java-sdk-ssm-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-api-gateway/1.11.128/aws-java-sdk-api-gateway-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-models/1.11.128/aws-java-sdk-models-1.11.128.jar
./.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.4/jackson-databind-2.7.4.jar
清空文件（适用于文件不需要保存，但正在写入）：echo "">file
删除文件（适用于文件不需要保存，不在写入）：rm -rf file

iostat：查看磁盘运行状态，用法如下：

[root@centos ~]# iostat --help
用法: iostat [ 选项 ] [ <时间间隔> [ <次数> ] ]
选项:
[ -c ] [ -d ] [ -h ] [ -k | -m ] [ -N ] [ -t ] [ -V ] [ -x ] [ -y ] [ -z ]
[ -j { ID | LABEL | PATH | UUID | ... } ]
[ [ -T ] -g <用户组名> ] [ -p [ <设备> [,...] | ALL ] ]
[ <设备> [...] | ALL ]
iostat -h：以可读方式显示数据
iostat -m：以m为单位显示数据
iostat -k：以k为单位显示数据
iostat：输出一次数据
iostat delay：每隔delay秒显示一次数据
iostat delay count：每隔delay秒显示一次数据，输出count次后停止

centos安装命令：yum install sysstat

示例

[root@centos ~]# iostat
Linux 3.10.0-957.el7.x86_64 (centos)     2021年05月03日     _x86_64_    (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.78    0.00    3.37    0.34    0.00   95.52
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              16.08       225.50       148.05     571180     374994
scd0              0.01         0.41         0.00       1028          0
dm-0             19.00       212.70       147.21     538751     372876
dm-1              0.03         0.87         0.00       2204          0
user：用户进程cpu使用量
nice：改变过优先级的用户进程cpu使用量
system：系统进程cpu使用量
iowait：等待io的cpu使用量           #如果该值超过20%，表明io存在瓶颈
steal：进程管理另外一个虚拟机程序占用的cpu（一般为0）
idle：cpu空闲的时间量
tps：设备每秒传输次数
kB_read/s：每秒读取数据(kB)
kB_wrtn/s：每秒写入数据(kB)
kB_read：读取的总数据量
kB_wrtn：写入的总数据量

*********************

netstat：网络监控

netstat可对网络连接、端口等信息进行监控，用法如下：

[root@centos ~]# netstat --help
usage: netstat [-vWeenNcCF] [<Af>] -r         netstat {-V|--version|-h|--help}
       netstat [-vWnNcaeol] [<Socket> ...]
       netstat { [-vWeenNac] -I[<Iface>] | [-veenNac] -i | [-cnNe] -M | -s [-6tuw] } [delay]
        -r, --route              display routing table
        -I, --interfaces=<Iface> display interface table for <Iface>
        -i, --interfaces         display interface table
        -g, --groups             display multicast group memberships
        -s, --statistics         display networking statistics (like SNMP)
        -M, --masquerade         display masqueraded connections
        -v, --verbose            be verbose
        -W, --wide               don't truncate IP addresses
        -n, --numeric            don't resolve names
        --numeric-hosts          don't resolve host names
        --numeric-ports          don't resolve port names
        --numeric-users          don't resolve user names
        -N, --symbolic           resolve hardware names
        -e, --extend             display other/more information
        -p, --programs           display PID/Program name for sockets
        -o, --timers             display timers
        -c, --continuous         continuous listing
        -l, --listening          display listening server sockets
        -a, --all                display all sockets (default: connected)
        -F, --fib                display Forwarding Information Base (default)
        -C, --cache              display routing cache instead of FIB
        -Z, --context            display SELinux security context for sockets

常用参数

-a：列出所有连接（tcp、udp、unix）

-at：列出所有tcp连接

-au：列出所有udp连接

-ax：列出所有unix连接

-n：列出ip，不用讲ip解析为域名

-s：连接的数据包收发状况

-l：列出状态为listen或者listening的连接

示例

[root@centos ~]# netstat -a|head -5
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN     
tcp        0      0 localhost:smtp          0.0.0.0:*               LISTEN     
tcp        0     36 centos:ssh              192.168.57.1:57227      ESTABLISHED
[root@centos ~]# netstat -an|head -5
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN     
tcp        0     36 192.168.57.120:22       192.168.57.1:57227      ESTABLISHED
Proto：连接协议类型，tcp、udp、unix
Recv-Q：队列里面还没有读取的数据
Send-Q：队列里面发送出去对方还未确认的数据
#Recv-Q、Send-Q应该为0，可接受短时间内不为0，若长时间不为0，表示网络数据发生堆积
Local Address：建立连接的本地地址、端口
Foreign Address：建立连接的另外一方的地址、端口
state：连接状态

查看服务器tcp连接数

[root@centos ~]# netstat -a|awk '/^tcp/ {++S[$NF]} END {for(a in S) print a " " S[a] }'
LISTEN 4
ESTABLISHED 2

如果tcp连接数过多，检查应用程序是否可以优化，或者可考虑多部署几台服务器分散负载

如果网站遭遇恶意攻击，可找出访问频繁的ip封禁

[root@centos ~]# cat access.log
172.17.1.2
172.17.1.1
172.18.1.1
172.16.1.1
172.16.1.1
172.16.1.1
172.17.1.2
[root@centos ~]# awk '{print $1}' access.log|sort|uniq -c|sort -nr -k 1|more
      3 172.16.1.1
      2 172.17.1.2
      1 172.18.1.1
      1 172.17.1.1
#172.16.1.1反问最频繁，如有需要可对该ip封禁