Troubleshooting ideas and solutions for high load problems in Linux systems
Load is a measure of how much the computer does, and Load Average is the average Load over a period of time (1 minute, 5 minutes, 15 minutes).
1. Load analysis:
Case 1: CPU high, Load high
- Use the top command to find the PID of the process that occupies the highest CPU;
- Find the thread TID with the highest CPU usage by top -Hp PID;
- For java programs, use jstack to print thread stack information (you can contact the business for troubleshooting and positioning);
- Print the hex of the most CPU-consuming thread through printf %x tid;
- View the stack information of the thread in the stack information;
Case 2: CPU is low, Load is high
- View the CPU waiting time for IO through the top command, that is, %wa;
- View disk IO through iostat -d -x -m 1 10; (installation command yum install -y sysstat)
- View network IO through sar -n DEV 1 10;
- Find the program that occupies IO by the following command;
ps -e -L h o state,cmd | awk '{if($1=="R"||$1=="D"){print $0}}' | sort | uniq -c | sort -k 1nr
2. Analysis of high CPU and high Load
- Use vmstat to view the CPU load of the system latitude;
- Use top to view the CPU load of the process latitude;
2.1. Use vmstat to view the CPU load of the system latitude
You can view the usage of CPU resources from the system dimension through vmstat
Format: vmstat -n 1 -n 1 means that the results are refreshed once a second
[root@k8s-10 ~]# vmstat -n 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 0 2798000 2076 6375040 0 0 10 76 10 49 6 2 91 1 0 0 0 0 2798232 2076 6375128 0 0 0 207 7965 12525 7 2 90 2 0
Description of the main data columns in the returned result:
-
r: Indicates the thread waiting for processing by the CPU in the system. Since the CPU can only process one thread at a time, a larger number usually means a slower system.
-
b: Indicates a blocked process, this is not much to say, the process is blocked, everyone understands.
-
us: User CPU time, on a server that performs frequent encryption and decryption, I can see that us is close to 100, and the running queue of r reaches 80 (the machine is under stress test, and its performance is not good).
-
sy: System CPU time, if it is too high, it means that the system call time is long, such as frequent IO operations.
-
wa: Percentage of CPU time consumed by IO waits. When the value is high, it indicates that the IO wait is serious, which may be caused by a large number of random accesses to the disk, or may be a bottleneck in the performance of the disk.
-
id: Percentage of CPU time in idle state. If the value is consistently 0 and sy is twice as large as us, it usually means that the system is facing a shortage of CPU resources.
Common problems and solutions:
-
If r is often greater than 4 and id is often less than 40, it means that the cpu is under heavy load.
-
If pi and po are not equal to 0 for a long time, it means insufficient memory.
-
If disk is often not equal to 0, and the queue in b is greater than 3, it means that io performance is not good.
2.1. Use top to view the CPU load of the process latitude
You can use top to view the usage of its CPU, memory and other resources from the process dimension.
[root@k8s-10 ~]# top -c top - 19:53:49 up 2 days, 7:57, 3 users, load average: 0.76, 0.79, 0.58 Tasks: 282 total, 2 running, 280 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.4 us, 1.4 sy, 0.0 ni, 95.0 id, 1.2 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 12304204 total, 2800864 free, 3119064 used, 6384276 buff/cache KiB Swap: 0 total, 0 free, 0 used. 8164632 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29884 root 20 0 5346580 929332 14556 S 0.0 7.6 6:19.19 /opt/jdk1.8.0_144/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apach+ 875 root 20 0 729524 563424 38612 S 3.1 4.6 93:22.70 kube-apiserver --authorization-mode=Node,RBAC --service-node-port-range=80-60000 --advertise-address=10.68.7.162 --allow-privileged=true -+ 3870 nfsnobo+ 20 0 910376 317248 22812 S 1.6 2.6 42:29.59 /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --storage.tsdb.retention=1d --web.enable-life+
The third line on the default interface will display the overall usage of the current CPU resources, and the resource usage of each process will be displayed below.
You can directly enter the letter P in the interface to arrange the monitoring results in reverse order according to the CPU usage, so as to locate the process that occupies a high CPU in the system. Finally, according to the system log and the relevant log of the program itself, further investigation and analysis of the corresponding process is carried out to determine the cause of its high CPU usage.
2.2, strace command analysis
https://oa.kedacom.com/confluence/pages/viewpage.action?pageId=77136289
3. Analysis of low CPU and high Load
Problem Description:
There is no business program running in the Linux system. Through the top observation, as shown in the following figure, the CPU is very idle, but the load average is very high:
Processing method:
- load average is an evaluation of the CPU load. The higher the value, the longer the task queue and the more tasks waiting to be executed.
- When this happens, it may be due to a zombie process. You can use the command ps -axjf to check whether there is a D state process.
- D-state refers to an uninterruptible sleep state. Processes in this state cannot be kill ed and cannot exit by themselves. It can only be solved by restoring the resources it depends on or restarting the system.
wait I/O process through uninterruptible sleep or D status; by giving this information we can easily find the source wait state of the process ps -eo state,pid,cmd | grep "^D"; echo "----" - find occupancy IO program of ps -e -L h o state,cmd | awk '{if($1=="R"||$1=="D"){print $0}}' | sort | uniq -c | sort -k 1nr
Forwarded to: https://www.cnblogs.com/liuyupen/p/13905967.html