How to troubleshoot multithread deadlock?

When using multithreading for development, you will probably encounter deadlock problems. How to troubleshoot the deadlock problem has also become very important.

  1. In the first step, we can capture the information of the current program running stack through pstack. After multiple capture and comparison, we can find the deadlock. Next, you can analyze the code. If it is relatively simple, you can determine that it is one of the four necessary conditions (mutual exclusion, possession and waiting, non preemption, circular waiting) and break it. If you can't locate and the code doesn't see any problems, you need to debug.
  2. Step 2: when a deadlock occurs, gdb attach process, info thread view thread, thread N switch threads, and find the mutex that is waiting for. Check step by step. Generally, most problems can be solved through this step.
  3. Step 3: if it is difficult to locate the problem in one or two steps, you can try the following steps: gdb debug the program, break the point where the unlock is added, and attach common. In this way, when a deadlock occurs, we can know the stack information of each current thread.

Here is a specific example

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define LEN 10000
int num = 0;
pthread_mutex_t g_mutex;

void* thread_func(void* arg) {
    for (int i = 0; i < LEN; ++i) {
        pthread_mutex_lock(&g_mutex);
        num += 1;
        if (num == 9999) return NULL;  //If the lock is not released, return directly. Here is just a simple example
        pthread_mutex_unlock(&g_mutex);
    }

    return NULL;
}

int main() {
    pthread_mutex_init(&g_mutex, NULL);

    pthread_t tid1, tid2, tid3;
    pthread_create(&tid1, NULL, thread_func, NULL);
    pthread_create(&tid2, NULL, thread_func, NULL);
    pthread_create(&tid3, NULL, thread_func, NULL);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);
    pthread_join(tid3, NULL);

    pthread_mutex_destroy(&g_mutex);

    printf("Check RST=%d, RST=%d.\n", 3 * LEN, num);
    return 0;
}
Method 1 pstack
[root@localhost ~]# ps -ef|grep ./thr
root     27736 26354  0 15:15 pts/1    00:00:00 ./thr
root     27753 27613  0 15:16 pts/2    00:00:00 grep --color=auto ./thr
[root@localhost ~]# pstack 27736
Thread 3 (Thread 0x7f26c4ebc700 (LWP 27737)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f26c46bb700 (LWP 27738)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f26c568a740 (LWP 27736)):
#0  0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000004008c9 in main () at thr.cpp:30
[root@localhost ~]# pstack 27736
Thread 3 (Thread 0x7f26c4ebc700 (LWP 27737)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f26c46bb700 (LWP 27738)):
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f26c568a740 (LWP 27736)):
#0  0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000004008c9 in main () at thr.cpp:30

You can see that there is not much difference between the two captures, and they are stuck in thr CPP: 13 this line.

Method 2

[root@localhost ~]# gdb attch 27736
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attch: No such file or directory.
Attaching to process 27736
Reading symbols from /root/jiangsu-wuxi/poco_demo/thr...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 27738]
[New LWP 27737]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64
(gdb) info thread
  Id   Target Id         Frame 
  3    Thread 0x7f26c4ebc700 (LWP 27737) "thr" 0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
  2    Thread 0x7f26c46bb700 (LWP 27738) "thr" 0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
* 1    Thread 0x7f26c568a740 (LWP 27736) "thr" 0x00007f26c5286ef7 in pthread_join () from /lib64/libpthread.so.0
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f26c4ebc700 (LWP 27737))]
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f26c528bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f26c5287d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f26c5287c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
#4  0x00007f26c5285dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f26c4fb321d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x000000000040080f in thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
(gdb) p g_mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 27739, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000[l\000\000\001", '\000' <repeats 26 times>, __align = 2}
(gdb) 

Can see__ owner = 27739. The mutex is occupied by 27739, but there is no thread.

Method 3: gdb, commands

[root@localhost poco_demo]# gdb ./thr
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/jiangsu-wuxi/poco_demo/thr...done.
(gdb) b thr.cpp :13
Breakpoint 1 at 0x400805: file thr.cpp, line 13.
(gdb) b thr.cpp :16
Breakpoint 2 at 0x400832: file thr.cpp, line 16.
(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000400805 in thread_func(void*) at thr.cpp:13
2       breakpoint     keep y   0x0000000000400832 in thread_func(void*) at thr.cpp:16
(gdb) commands 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>p "lock"
>thread
>c
>end
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>p "unlock"
>thread
>c
>end
(gdb) set pagination off
......
Breakpoint 1, thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
$19997 = "lock"
[Current thread is 2 (Thread 0x7ffff77fd700 (LWP 29692))]
[Switching to Thread 0x7ffff67fb700 (LWP 29694)]

Breakpoint 2, thread_func (arg=0x0) at thr.cpp:16
16	        pthread_mutex_unlock(&g_mutex);
$19998 = "unlock"
[Current thread is 4 (Thread 0x7ffff67fb700 (LWP 29694))]

Breakpoint 1, thread_func (arg=0x0) at thr.cpp:13
13	        pthread_mutex_lock(&g_mutex);
$19999 = "lock"
[Current thread is 4 (Thread 0x7ffff67fb700 (LWP 29694))]
[Thread 0x7ffff77fd700 (LWP 29692) exited]

Finally, it is found that the value is stuck in thr CPP: 13, and then analyze the problem according to the current stack to determine if (num == 9999) return NULL; When this line returns, the lock is not released. As a result, other threads, including this thread, can no longer get the lock.
ps: if set pagination off is not set, the debugging will be suspended automatically after a certain amount is reached, and the debugging can continue only after receiving confirmation

Tags: Linux

Posted by gpong on Sat, 14 May 2022 05:03:21 +0300