调试垃圾回收中的python分段错误

2024-06-26 14:42:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我遇到了cPython中垃圾回收过程中出现的分段错误(SIGSEGV)。我也有过一个进程被SIGBUS杀死的例子。我自己的代码主要是python和一些非常高级的Cython。我当然不会——故意地、明确地——胡乱地使用指针或者直接写内存。在


来自coredumps的示例回溯(使用gdb提取):

#0  0x00007f8b0ac29471 in subtype_dealloc (self=<Task at remote 0x7f8afc0466d8>)
    at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1182
#1  0x00007f8b0abe8947 in method_dealloc (im=0x7f8afc883e08) at /usr/src/debug/Python-3.5.1/Objects/classobject.c:198
#2  0x00007f8b0ac285a9 in clear_slots (type=type@entry=0x560219f0fa88, 
    self=self@entry=<Handle at remote 0x7f8afc035948>) at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1044
#3  0x00007f8b0ac29506 in subtype_dealloc (self=<Handle at remote 0x7f8afc035948>)
    at /usr/src/debug/Python-3.5.1/Objects/typeobject.c:1200
#4  0x00007f8b0ac8caad in PyEval_EvalFrameEx (
    f=f@entry=Frame 0x56021a01ff08, for file /usr/lib64/python3.5/asyncio/base_events.py, line 1239, in _run_once (self=<_UnixSelectorEventLoop(_coroutine_wrapper_set=False, _current_handle=None, _ready=<collections.deque at remote 0x7f8afd39a250>, _closed=False, _task_factory=None, _selector=<EpollSelector(_map=<_SelectorMapping(_selector=<...>) at remote 0x7f8afc868748>, _epoll=<select.epoll at remote 0x7f8b0b1b8e58>, _fd_to_key={4: <SelectorKey at remote 0x7f8afcac8a98>, 6: <SelectorKey at remote 0x7f8afcac8e08>, 7: <SelectorKey at remote 0x7f8afcac8e60>, 8: <SelectorKey at remote 0x7f8afc873048>, 9: <SelectorKey at remote 0x7f8afc873830>, 10: <SelectorKey at remote 0x7f8afc873af0>, 11: <SelectorKey at remote 0x7f8afc87b620>, 12: <SelectorKey at remote 0x7f8afc87b7d8>, 13: <SelectorKey at remote 0x7f8afc889af0>, 14: <SelectorKey at remote 0x7f8afc884678>, 15: <SelectorKey at remote 0x7f8afc025eb8>, 16: <SelectorKey at remote 0x7f8afc889db0>, 17: <SelectorKey at remote 0x7f8afc01a258>, 18: <SelectorKey at remote 0x7f8afc...(truncated), 
    throwflag=throwflag@entry=0) at /usr/src/debug/Python-3.5.1/Python/ceval.c:1414

在扫荡过程中(我想):

^{pr2}$

在malloc期间也有一次:

#0  _PyObject_Malloc (ctx=0x0, nbytes=56) at /usr/src/debug/Python-3.4.3/Objects/obmalloc.c:1159
1159                if ((pool->freeblock = *(block **)bp) != NULL) {
(gdb) bt
#0  _PyObject_Malloc (ctx=0x0, nbytes=56) at /usr/src/debug/Python-3.4.3/Objects/obmalloc.c:1159

以及SIGBUS跟踪(看起来是在cPython从另一个错误中恢复时发生的):

#0  malloc_printerr (ar_ptr=0x100101f0100101a, ptr=0x7f067955da60 <generations+32>, str=0x7f06785a2b8c "free(): invalid size", action=3) at malloc.c:5009
5009        set_arena_corrupt (ar_ptr);
(gdb) bt
#0  malloc_printerr (ar_ptr=0x100101f0100101a, ptr=0x7f067955da60 <generations+32>, str=0x7f06785a2b8c "free(): invalid size", action=3) at malloc.c:5009
#1  _int_free (av=0x100101f0100101a, p=<optimized out>, have_lock=0) at malloc.c:3842
Python Exception <type 'exceptions.RuntimeError'> Type does not have a target.:

这些回溯是从fedora24和python3.5.1和centos7的python3.4.3中得到的。所以我排除了以下问题:

  • 内存不好(可能是,但很巧的是,我的笔记本电脑和一个(虚拟)服务器都出现了相同的问题,而且在其他方面没有表现出良好的性能)。在
  • 操作系统或cPython或其组合中的问题。在

所以-像往常一样-这一定是我自己的代码。它是线程代码的混合体,用于(计算性的)“任务”和运行异步循环的线程。代码库还有其他运行良好的“工作负载”。在“服务”这个“工作负载”的代码中,我最突出的区别是我所使用的(线程.RLock)锁相当多,以序列化一些请求,我正在序列化并写入磁盘。在

任何关于如何找到根本原因的建议将不胜感激!在

我尝试过的事情:

  • 将外部依赖性剥离到最低限度(cloudpickle):没有区别
  • 垃圾回收器在显式收集之前和之后看到的所有对象类型的转储计数,并查看是否有在崩溃之前被跟踪的对象类型,并且不在“成功”垃圾收集中:没有
  • 使用GDB进行核心转储:在C和Python回溯方面并不完全一致。在
  • 使用MALLOC CHECK_2运行:我没有看到任何错误消息,只是进程已退出,退出代码为-11。在

编辑1

有发现:https://gist.github.com/frensjan/dc9bc784229fec844403c9d9528ada66

最值得注意的是:

==19072== Invalid write of size 8
==19072==    at 0x47A3B7: subtype_dealloc (typeobject.c:1157)
==19072==    by 0x4E0696: PyEval_EvalFrameEx (ceval.c:1388)
==19072==    by 0x58917B: gen_send_ex (genobject.c:104)
==19072==    by 0x58917B: _PyGen_Send (genobject.c:158)
...
==19072==    by 0x4E2A6A: call_function (ceval.c:4262)
==19072==    by 0x4E2A6A: PyEval_EvalFrameEx (ceval.c:2838)
==19072==  Address 0x8 is not stack'd, malloc'd or (recently) free'd

以及

==19072== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==19072==  Access not within mapped region at address 0x8
==19072==    at 0x47A3B7: subtype_dealloc (typeobject.c:1157)
==19072==    by 0x4E0696: PyEval_EvalFrameEx (ceval.c:1388)
==19072==    by 0x58917B: gen_send_ex (genobject.c:104)
==19072==    by 0x58917B: _PyGen_Send (genobject.c:158)

但是valgrind发现的错误指向了与我之前得到的coredump相同的位置,离我的代码不远。。。不知道该怎么办。在

环境:在Centos 7上用./configure --without-pymalloc构建的Python3.4.5+。用valgrind --tool=memcheck --dsymutil=yes --track-origins=yes --show-leak-kinds=all --trace-children=yes python ...运行python

非常感谢任何帮助!在


Tags: 代码indebugselfsrcbyobjectsremote