java XReply（）以XIOError（）终止应用程序

4 月，3 周 Questions & Answers 3222

我们正在开发一些复杂的应用程序，它由linux二进制文件和java jni调用（来自用linux二进制文件创建的JVM）集成而成，这些调用来自我们定制的应用程序。jar文件。所有gui工作都由java部分实现和完成。每次必须更改某些gui属性或重新绘制gui时，都是通过jni调用JVM来完成的

以JVM/java能够处理的速度重新绘制（或刷新）完整的显示/gui。它以迭代的方式频繁地进行，每秒只有几百次或数千次迭代

经过一段精确的时间后，应用程序被exit(1)终止，我用gdb捕捉到它，从_XIOError()调用它。这种终止可以在大致准确的时间段后重复，例如在x86双核2.5GHz上约15小时后。如果我使用速度较慢的计算机，它会持续更长时间，就像它与cpu/gpu速度成正比一样。一些结论可能是xorg的某些部分耗尽了某些资源或类似的东西

这是我的回溯：

#0  0xb7fe1424 in __kernel_vsyscall ()
#1  0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2  0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3  0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4  0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5  0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6  0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7  0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8  0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9  0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so

我打了自己的电话。因此，在abort()/SIGABRT的帮助下，将其与LD_PRELOAD一起用于捕获gdb中的exit()调用。在对libX11和libxcb进行了一些调试之后，我注意到_XReply()得到了NULL回复（来自xcb_wait_for_reply()的响应），这导致调用_XIOError()和exit(1)。深入了解libxcb inxcb_wait_for_reply()函数，我注意到它可以返回NULL reply的原因之一是当它检测到断开或关闭的socket连接时，这可能是我的情况

出于测试目的，如果我更改xcb_io。c和ignore _XIOError()，应用程序不再工作。如果我在_XReply()内重复请求，它每次都会失败，即在每个xcb_wait_for_reply()上获得空响应

所以，我的问题是，为什么要用退出_XReply()->XIOError()->exit(1)或者如何找出原因和发生的事情，以便修复它或进行一些变通

正如我在上面所写的，为了让这个问题再次出现，我必须等待大约15小时，但目前我调试的时间非常短，无法找到问题/终止的原因。我们还尝试重新组织处理gui/显示刷新的java部分，但问题没有得到解决

一些事实：
-java jre 1.8.0_20，即使使用java 7，也会重复这个问题
-libX11。所以1.5.0
-libxcb。所以1.8.1
-debian哮喘病
-内核3.2.0

Tags:

共 (1) 个答案

# 1 楼答案
这可能是libX11中关于处理用于xcb_wait_for_reply的请求号的已知问题

在libxcb v1之后的某个时刻。5.引入了在任何地方内部使用64位序列号的代码，并添加了逻辑，以便在进入仍然使用32位序列号的公共API时扩大序列号

以下是submitted libxcb bug report（实际删除的电子邮件）的引用：
We have an application that does a lot of XDrawString and XDrawLine. After several hours the application is exited by an XIOError.

The XIOError is called in libX11 in the file xcb_io.c, function _XReply. It didn't get a response from xcb_wait_for_reply.

libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to this commit:

commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey Sharp Date: Sat Oct 9 17:13:45 2010 -0700
```
xcb_in: Use 64-bit sequence numbers internally everywhere.

Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.

Signed-off-by: Jamey Sharp <jamey@xxxxxx.xxx>
```
Reverting it on top of 1.8.1 helps.

Adding traces to libxcb I found that the last request numbers used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in the while loop of the _XReply function), half a second later: 63215 (then XIOError is called). The widen_request is also 63215, I would have expected 63215+2^32. Therefore it seems that the request is not correctly widened.

The commit above also changed the compares in poll_for_reply from XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening never worked correctly, but it was never observed, because only the lower 32bits were compared.
重现问题

以下是提交的错误报告中用于重现问题的原始代码片段：
```
  for(;;) {
    XDrawLine(dpy, w, gc, 10, 60, 180, 20);
    XFlush(dpy);
  }
```
显然，这个问题可以用更简单的代码重现：
```
 for(;;) {
    XNoOp(dpy);
  }
```
根据提交的libxcb bug报告，复制需要这些条件（假设复制代码在xdraw.c中）：
- libxcb >= 1.8 (i.e. includes the commit ed37b08)
- compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
- the sequence counter wraps.
建议的补丁

可应用于libxcb 1.8.1之上的建议补丁如下：
```
diff  git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
 - a/src/xcb_io.c
+++ b/src/xcb_io.c
@@ -454,7 +454,7 @@ void _XSend(Display *dpy, const char *data, long size)
        static const xReq dummy_request;
        static char const pad[3];
        struct iovec vec[3];
-       uint64_t requests;
+       unsigned long requests;
        _XExtension *ext;
        xcb_connection_t *c = dpy->xcb->connection;
        if(dpy->flags & XlibDisplayIOError)
@@ -470,7 +470,7 @@ void _XSend(Display *dpy, const char *data, long size)
        if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
        {
                uint64_t sequence;
-               for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+               for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
                        append_pending_request(dpy, sequence);
        }
        requests = dpy->request - dpy->xcb->last_flushed;
```
详细的技术说明

Plase find bellow包含detailed technical explanation by Jonas Petersen（也包含在上述错误报告中）：
Hi,

Here's two patches. The first one fixes a 32-bit sequence wrap bug. The second patch only adds a comment to another relevant statement.

The patches contain some details. Here is the whole story for who might be interested:

Xlib (libx11) will crash an application with a "Fatal IO error 11 (Resource temporarily unavailable)" after 4 294 967 296 requests to the server. That is when the Xlib internal 32-bit sequence wraps.

Most applications probably will hardly reach this number, but if they do, they have a chance to die a mysterious death. For example the application I'm working on did always crash after about 20 hours when I started to do some stress testing. It does some intensive drawing through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per second in full hd resolution (on Ubuntu). Some optimizations did extend the grace to about 35 hours but it would still crash.

What then followed was some frustrating weeks of digging and debugging to realize that it's not in my application, nor in gtkmm, gtk or glib but that it's this little bug in Xlib which exists since 2006-10-06 apparently.

It took a while to turn out that the number 0x100000000 (2^32) has some relevance. (Much) later it turned out it can be reproduced with Xlib only, using this code for example:

while(1) { XDrawPoint(display, drawable, gc, x, y); XFlush(display); }

It might take one or two hours, but when it reaches the 4294 million it will explode into a "Fatal IO error 11".

What I then learned is that even though Xlib uses internal 32bit sequence numbers they get (smartly) widened to 64bit in the process so that the 32bit sequence may wrap without any disruption in the widened 64bit sequence. Obviously there must be something wrong with that.

The Fatal IO error is issued in _XReply() when it's not getting a reply where there should be one, but the cause is earlier in _XSend() in the moment when the Xlib 32-bit sequence number wraps.

The problem is that when it wraps to 0, the value of 'last_flushed' will still be at the upper boundary (e.g. 0xffffffff). There is two locations in _XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:

requests = dpy->request - dpy->xcb->last_flushed;

I case of request = 0x0 and last_flushed = 0xffffffff it will assign 0xffffffff00000001 to 'requests' and then to XCB as a number (amount) of requests. This is the main killer.

The second location is this:

for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; \ ++sequence)

I case of request = 0x0 (less than last_flushed) there is no chance to enter the loop ever and as a result some requests are ignored.

The solution is to "unwrap" dpy->request at these two locations and thus retain the sequence related to last_flushed.

uint64_t unwrapped_request = ((uint64_t)(dpy->request < \ dpy->xcb->last_flushed) << 32) + dpy->request;

It creates a temporary 64-bit request number which has bit 8 set if 'request' is less than 'last_flushed'. It is then used in the two locations instead of dpy->request.

I'm not sure if it might be more efficient to use that statement inplace, instead of using a variable.

There is another line in require_socket() that worried me at first:

dpy->xcb->last_flushed = dpy->request = sent;

That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to 32-bit when assinging it to 'request' and then also assign the truncated value to the (64-bit) 'last_flushed'. But it seems inteded. I have added a note explaining that for the next poor soul debugging sequence issues... :-)
- Jonas
Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping xcb_io: Add comment explaining a mixed type double assignment

src/xcb_io.c | 14 +++++++++++ - 1 file changed, 11 insertions(+), 3 deletions(-)

1.7.10.4
祝你好运

Python中文网

有 Java 编程相关的问题?

java XReply（）以XIOError（）终止应用程序

共 (1) 个答案

# 1 楼答案