linux 服务器故障排查
linux 服务器故障排查
linux 服务器故障可从cpu、内存、磁盘、网络状况等几个方面进行排查
#
*********************
vmstat:整体性能监控
vmstat可查看进程运行、内存、swap、io、cpu等信息,用法如下
[root@centos ~]# vmstat -h
Usage:
vmstat [options] [delay [count]]
#vmstat:输出一次信息
#vmstat delay:间隔delay秒,周期性的输出信息
#vmstat delay count:间隔delay秒,共输出count次信息
Options:
-a, --active active/inactive memory
-f, --forks number of forks since boot
-m, --slabs slabinfo
-n, --one-header do not redisplay header
-s, --stats event counter statistics
-d, --disk disk statistics
-D, --disk-sum summarize disk statistics
-p, --partition <dev> partition specific statistics
-S, --unit <char> define display unit
-w, --wide wide output
-t, --timestamp show timestamp
-h, --help display this help and exit
-V, --version output version information and exit
输出信息说明
[root@centos ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 1551100 2108 168820 0 0 112 10 86 275 1 3 95 0 0
#procs
r:正在运行或者等待获取cpu的进程数 //若该值长期大于cpu数,表明cpu不足
b:阻塞等待io的进程数
#memory
free:空闲内存
buff:使用block管理的内存时,使用的缓存
cache:使用page管理的内存时,使用的缓存
#swap
si:交换区写入内存的大小
so:内存写入交换区大小 //si、so应为0,若长期不为0,说明内存空间不足
#io
bi:每秒读取的块数
bo:每秒写入的块数 //bi+bo>1000,则cpu中wa会比较大,说明io存在瓶颈
#system
in(interupt):每秒中断数
cs(count/second):每秒上下文切换数 //in、cs越大,cpu中sy占用的cpu越大
#cpu
us:用户进程占用cpu //us长期大于50%,需要对进程优化,或者新增服务器部署
sy:系统进程占用cpu //us+sy长期大于80%,说明cpu资源不足
id:cpu空闲时间
wa:进程等待io占用时间 //超过20%,说明io等待严重
st:虚拟机占用的cpu时间,一般为0
#
*********************
free:查看内存
free可查看剩余可用内存,用法如下:
[root@centos ~]# free --help
Usage:
free [options]
Options:
-b, --bytes show output in bytes #以字节为单位显示内存
-k, --kilo show output in kilobytes #以kb为单位显示内存
-m, --mega show output in megabytes #以m为单位显示内存
-g, --giga show output in gigabytes #以g为单位显示内存
--tera show output in terabytes
--peta show output in petabytes
-h, --human show human-readable output #以可读方式显示内存
--si use powers of 1000 not 1024
-l, --lohi show detailed low and high memory statistics #显示高低内存
-t, --total show total for RAM + swap #显示总内存:mem+swap
-s N, --seconds N repeat printing every N seconds #周期性间隔n秒显示内存
-c N, --count N repeat printing N times, then exit #显示N次内存后退出
-w, --wide wide output
--help display this help and exit
-V, --version output version information and exit
示例
[root@centos ~]# free -h
total used free shared buff/cache available
Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
Swap: 6.0G 0B 6.0G
#显示总内存
[root@centos ~]# free -t
total used free shared buff/cache available
Mem: 1863224 141796 1549804 9740 171624 1538100
Swap: 6291452 0 6291452
Total: 8154676 141796 7841256
[root@centos ~]# free -th
total used free shared buff/cache available
Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
Swap: 6.0G 0B 6.0G
Total: 7.8G 138M 7.5G
#显示高低内存
[root@centos ~]# free -lh
total used free shared buff/cache available
Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
Low: 1.8G 305M 1.5G
High: 0B 0B 0B
Swap: 6.0G 0B 6.0G
[root@centos ~]# free -lht
total used free shared buff/cache available
Mem: 1.8G 138M 1.5G 9.5M 167M 1.5G
Low: 1.8G 305M 1.5G
High: 0B 0B 0B
Swap: 6.0G 0B 6.0G
Total: 7.8G 138M 7.5G
*********************
top、uptime:查看cpu
uptime:查看cpu 1分钟、5分钟、15分钟的平均负载(cpu运行数)
[root@centos ~]# uptime -h
Usage:
uptime [options]
Options:
-p, --pretty show uptime in pretty format
-h, --help display this help and exit
-s, --since system up since
-V, --version output version information and exit
#输出信息说明
[root@centos ~]# uptime
18:21:32 up 1:13, 1 user, load average: 0.00, 0.01, 0.05
18:21:32:当前系统时间
1:13:系统运行1小时13分钟
load average:0.00, 0.01, 0.05:系统1分钟、5分钟、15分钟的平均负载为0.00、0.01、0.05
如果平均负载/cpu数:不超过3,表明cpu运行状况良好;长期大于5,表明cpu负载过高
top:实时查看系统负载、进程信息、cpu运行状况、内存状况等信息
[root@centos ~]# top
#系统负载,参照uptime
top - 18:33:49 up 1:25, 1 user, load average: 0.00, 0.01, 0.05
#进程信息:100个进程,2个正在运行、98个休眠、0个僵尸进程
Tasks: 100 total, 2 running, 98 sleeping, 0 stopped, 0 zombie
#cpu运行状况
%Cpu(s): 0.0 us, 10.5 sy, 0.0 ni, 89.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
us:用户进程占用0.0%
sy:系统进程占用10.5%
ni:用户进程空间内改变过优先级的进程占用0.0%
id:空闲cpu占比89.5%
wa:等待io的进程占用0.0%
#内存和swap信息
KiB Mem : 1863224 total, 1549640 free, 141960 used, 171624 buff/cache
KiB Swap: 6291452 total, 6291452 free, 0 used. 1537936 avail Mem
#进程运行状况
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7626 root 20 0 161884 2160 1532 R 5.0 0.1 0:00.03 top
1 root 20 0 128008 6564 4148 S 0.0 0.4 0:03.53 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.32 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 R 0.0 0.0 0:01.66 rcu_sched
10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
11 root rt 0 0 0 0 S 0.0 0.0 0:00.13 watchdog/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
14 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 writeback
17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kintegrityd
18 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 bioset
*********************
df、iostat:磁盘监控
df:查看磁盘剩余空间,用法如下:
[root@centos ~]# df --help
用法:df [选项]... [文件]...
Show information about the file system on which each FILE resides,
or all file systems by default.
Mandatory arguments to long options are mandatory for short options too.
-a, --all include pseudo, duplicate, inaccessible file systems
-B, --block-size=SIZE scale sizes by SIZE before printing them; e.g.,
'-BM' prints sizes in units of 1,048,576 bytes;
see SIZE format below
--direct show statistics for a file instead of mount point
--total produce a grand total
-h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G)
-H, --si likewise, but use powers of 1000 not 1024
-i, --inodes 显示inode 信息而非块使用量
-k 即--block-size=1K
-l, --local 只显示本机的文件系统
--no-sync 取得使用量数据前不进行同步动作(默认)
--output[=FIELD_LIST] use the output format defined by FIELD_LIST,
or print all fields if FIELD_LIST is omitted.
-P, --portability use the POSIX output format
--sync invoke sync before getting usage info
-t, --type=TYPE limit listing to file systems of type TYPE
-T, --print-type print file system type
-x, --exclude-type=TYPE limit listing to file systems not of type TYPE
-v (ignored)
--help 显示此帮助信息并退出
--version 显示版本信息并退出
示例
[root@centos ~]# df -h
文件系统 容量 已用 可用 已用% 挂载点
/dev/mapper/centos-root 72G 20G 53G 27% /
devtmpfs 898M 0 898M 0% /dev
tmpfs 910M 0 910M 0% /dev/shm
tmpfs 910M 9.5M 901M 2% /run
tmpfs 910M 0 910M 0% /sys/fs/cgroup
/dev/sda1 2.0G 146M 1.9G 8% /boot
tmpfs 182M 0 182M 0% /run/user/0
若磁盘空间不足,找出最大的文件,若文件不再使用,可清空或者删除
du:查找最大的文件或者目录,用法如下:
[root@centos ~]# du --help
用法:du [选项]... [文件]...
或:du [选项]... --files0-from=F
Summarize disk usage of each FILE, recursively for directories.
Mandatory arguments to long options are mandatory for short options too.
-0, --null end each output line with 0 byte rather than newline
-a, --all write counts for all files, not just directories
--apparent-size print apparent sizes, rather than disk usage; although
the apparent size is usually smaller, it may be
larger due to holes in ('sparse') files, internal
fragmentation, indirect blocks, and the like
-B, --block-size=SIZE scale sizes by SIZE before printing them; e.g.,
'-BM' prints sizes in units of 1,048,576 bytes;
see SIZE format below
-b, --bytes equivalent to '--apparent-size --block-size=1'
-c, --total produce a grand total
-D, --dereference-args dereference only symlinks that are listed on the
command line
-d, --max-depth=N print the total for a directory (or file, with --all)
only if it is N or fewer levels below the command
line argument; --max-depth=0 is the same as
--summarize
--files0-from=F summarize disk usage of the
NUL-terminated file names specified in file F;
if F is -, then read names from standard input
-H equivalent to --dereference-args (-D)
-h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G)
#以可读方式显示数据,不同的数据可根据大小以不同的单位显示
--inodes list inode usage information instead of block usage
-k like --block-size=1K #以k为单位显示数据
-L, --dereference dereference all symbolic links
-l, --count-links count sizes many times if hard linked
-m like --block-size=1M #以m为单位显示数据
-P, --no-dereference don't follow any symbolic links (this is the default)
-S, --separate-dirs for directories do not include size of subdirectories
--si like -h, but use powers of 1000 not 1024
-s, --summarize display only a total for each argument
-t, --threshold=SIZE exclude entries smaller than SIZE if positive,
or entries greater than SIZE if negative
--time show time of the last modification of any file in the
directory, or any of its subdirectories
--time=WORD show time as WORD instead of modification time:
atime, access, use, ctime or status
--time-style=STYLE show times using STYLE, which can be:
full-iso, long-iso, iso, or +FORMAT;
FORMAT is interpreted like in 'date'
-X, --exclude-from=FILE exclude files that match any pattern in FILE
--exclude=PATTERN exclude files that match PATTERN
-x, --one-file-system skip directories on different file systems
--help 显示此帮助信息并退出
--version 显示版本信息并退出
示例
#查找指定目录下最大目录:du -k --max-depth=1 / | sort -nr |head -5
[root@centos ~]# du -k --max-depth=1|sort -nr -k 1|head -5
96320 .
96192 ./.m2
48 ./.docker
12 ./.ssh
0 ./.pki
#找到当前目录下的大于1M的文件:find . -type f -size +1M
[root@centos ~]# find . -type f -size +1M
./.m2/repository/org/apache/lucene/lucene-core/3.6.0/lucene-core-3.6.0.jar
./.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
./.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar
./.m2/repository/com/amazonaws/aws-java-sdk-iam/1.11.128/aws-java-sdk-iam-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-ec2/1.11.128/aws-java-sdk-ec2-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-dynamodb/1.11.128/aws-java-sdk-dynamodb-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-ssm/1.11.128/aws-java-sdk-ssm-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-api-gateway/1.11.128/aws-java-sdk-api-gateway-1.11.128.jar
./.m2/repository/com/amazonaws/aws-java-sdk-models/1.11.128/aws-java-sdk-models-1.11.128.jar
./.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.7.4/jackson-databind-2.7.4.jar
清空文件(适用于文件不需要保存,但正在写入):echo "">file
删除文件(适用于文件不需要保存,不在写入):rm -rf file
iostat:查看磁盘运行状态,用法如下:
[root@centos ~]# iostat --help
用法: iostat [ 选项 ] [ <时间间隔> [ <次数> ] ]
选项:
[ -c ] [ -d ] [ -h ] [ -k | -m ] [ -N ] [ -t ] [ -V ] [ -x ] [ -y ] [ -z ]
[ -j { ID | LABEL | PATH | UUID | ... } ]
[ [ -T ] -g <用户组名> ] [ -p [ <设备> [,...] | ALL ] ]
[ <设备> [...] | ALL ]
iostat -h:以可读方式显示数据
iostat -m:以m为单位显示数据
iostat -k:以k为单位显示数据
iostat:输出一次数据
iostat delay:每隔delay秒显示一次数据
iostat delay count:每隔delay秒显示一次数据,输出count次后停止
centos安装命令:yum install sysstat
示例
[root@centos ~]# iostat
Linux 3.10.0-957.el7.x86_64 (centos) 2021年05月03日 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.78 0.00 3.37 0.34 0.00 95.52
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 16.08 225.50 148.05 571180 374994
scd0 0.01 0.41 0.00 1028 0
dm-0 19.00 212.70 147.21 538751 372876
dm-1 0.03 0.87 0.00 2204 0
user:用户进程cpu使用量
nice:改变过优先级的用户进程cpu使用量
system:系统进程cpu使用量
iowait:等待io的cpu使用量 #如果该值超过20%,表明io存在瓶颈
steal:进程管理另外一个虚拟机程序占用的cpu(一般为0)
idle:cpu空闲的时间量
tps:设备每秒传输次数
kB_read/s:每秒读取数据(kB)
kB_wrtn/s:每秒写入数据(kB)
kB_read:读取的总数据量
kB_wrtn:写入的总数据量
*********************
netstat:网络监控
netstat可对网络连接、端口等信息进行监控,用法如下:
[root@centos ~]# netstat --help
usage: netstat [-vWeenNcCF] [<Af>] -r netstat {-V|--version|-h|--help}
netstat [-vWnNcaeol] [<Socket> ...]
netstat { [-vWeenNac] -I[<Iface>] | [-veenNac] -i | [-cnNe] -M | -s [-6tuw] } [delay]
-r, --route display routing table
-I, --interfaces=<Iface> display interface table for <Iface>
-i, --interfaces display interface table
-g, --groups display multicast group memberships
-s, --statistics display networking statistics (like SNMP)
-M, --masquerade display masqueraded connections
-v, --verbose be verbose
-W, --wide don't truncate IP addresses
-n, --numeric don't resolve names
--numeric-hosts don't resolve host names
--numeric-ports don't resolve port names
--numeric-users don't resolve user names
-N, --symbolic resolve hardware names
-e, --extend display other/more information
-p, --programs display PID/Program name for sockets
-o, --timers display timers
-c, --continuous continuous listing
-l, --listening display listening server sockets
-a, --all display all sockets (default: connected)
-F, --fib display Forwarding Information Base (default)
-C, --cache display routing cache instead of FIB
-Z, --context display SELinux security context for sockets
常用参数
-a:列出所有连接(tcp、udp、unix)
-at:列出所有tcp连接
-au:列出所有udp连接
-ax:列出所有unix连接
-n:列出ip,不用讲ip解析为域名
-s:连接的数据包收发状况
-l:列出状态为listen或者listening的连接
示例
[root@centos ~]# netstat -a|head -5
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN
tcp 0 0 localhost:smtp 0.0.0.0:* LISTEN
tcp 0 36 centos:ssh 192.168.57.1:57227 ESTABLISHED
[root@centos ~]# netstat -an|head -5
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
tcp 0 36 192.168.57.120:22 192.168.57.1:57227 ESTABLISHED
Proto:连接协议类型,tcp、udp、unix
Recv-Q:队列里面还没有读取的数据
Send-Q:队列里面发送出去对方还未确认的数据
#Recv-Q、Send-Q应该为0,可接受短时间内不为0,若长时间不为0,表示网络数据发生堆积
Local Address:建立连接的本地地址、端口
Foreign Address:建立连接的另外一方的地址、端口
state:连接状态
查看服务器tcp连接数
[root@centos ~]# netstat -a|awk '/^tcp/ {++S[$NF]} END {for(a in S) print a " " S[a] }'
LISTEN 4
ESTABLISHED 2
如果tcp连接数过多,检查应用程序是否可以优化,或者可考虑多部署几台服务器分散负载
如果网站遭遇恶意攻击,可找出访问频繁的ip封禁
[root@centos ~]# cat access.log
172.17.1.2
172.17.1.1
172.18.1.1
172.16.1.1
172.16.1.1
172.16.1.1
172.17.1.2
[root@centos ~]# awk '{print $1}' access.log|sort|uniq -c|sort -nr -k 1|more
3 172.16.1.1
2 172.17.1.2
1 172.18.1.1
1 172.17.1.1
#172.16.1.1反问最频繁,如有需要可对该ip封禁
还没有评论,来说两句吧...