Bug 568859 - [GTK] Crashes and deadlocks from certain SWT operations on background thread
Summary: [GTK] Crashes and deadlocks from certain SWT operations on background thread
Status: NEW
Alias: None
Product: Platform
Classification: Eclipse Project
Component: SWT (show other bugs)
Version: 4.15   Edit
Hardware: PC Linux
: P3 normal with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Platform-SWT-Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 575402
Blocks:
  Show dependency tree
 
Reported: 2020-11-16 10:54 EST by Will Rogers CLA
Modified: 2021-08-29 11:14 EDT (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Will Rogers CLA 2020-11-16 10:54:55 EST
An editor in an Eclipse RCP application renders a subclass of Canvas 'prepares' an image on a background thread (using a GC), and then queues a task on the UI thread to call redraw().

This has worked fine for years but when upgrading from 2018-12 to 2020-12 this behaviour is now apparently problematic on Linux / GTK3. Usually the app will immediately crash and occasionally the entire Gnome desktop session will lock up. A message such as 

[xcb] Unknown request in queue while dequeuing
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
java: xcb_io.c:165: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

will often be printed to the console. I cannot find any other relevant logfile.

I see this problem running gnome-shell 3.28.3 on RHEL 7.8. There is no problem when running on Windows.

I cannot reproduce this problem in a small SWT app. The full code is here but is complex: https://github.com/ControlSystemStudio/cs-studio/blob/master/applications/appunorganized/appunorganized-plugins/org.csstudio.swt.rtplot/src/org/csstudio/swt/rtplot/internal/Plot.java

Another example error:

(cs-studio:14332): Gdk-ERROR **: 15:43:07.742: The program 'cs-studio' received an X Window System error.
This probably reflects a bug in the program.
The error was 'RenderBadPicture (invalid Picture parameter)'.
  (Details: serial 16955 error_code 143 request_code 139 (RENDER) minor_code 26)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the GDK_SYNCHRONIZE environment
   variable to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)


This comment explains exactly the same bug:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=546743#c17
Comment 1 Alexandr Miloslavskiy CLA 2020-11-16 11:16:49 EST
It makes things a lot easier if your application is not confidential.

I would suggest to approach reproducer snippet this way:
1) Create a new project
2) Take entire 'Plot.java' you have linked there, and any dependencies that you can't just NULL or delete from code.
3) Ensure that issue is reproducible, commit.
4) Delete more code, ensure issue is still there, commit.

Spend a bit of time on that, and then there will be good chances that someone can look into remaining code. Unfortunately I'm too busy currently to try to debug lots of code. Not sure about other team members.
Comment 2 Nikita Nemkin CLA 2020-11-16 12:17:35 EST
(In reply to Will Rogers from comment #0)
> Usually the app will immediately crash and occasionally the entire
> Gnome desktop session will lock up.
Please elaborate. Which app should I run to see an immediate crash? Is it the master branch of cs-studio? (The README says that Eclipse-based cs-studio is deprecated.)

Looking at the code, the drawing in Plot.updateImageBuffer() is fine, except this bit at the end: https://github.com/ControlSystemStudio/cs-studio/blob/0ca3c2ccefaeee8b535f60ec74424041c4cf4cf5/applications/appunorganized/appunorganized-plugins/org.csstudio.swt.rtplot/src/org/csstudio/swt/rtplot/internal/Plot.java#L729-L738

Creating an off-screen image and painting into it should work without extra locking. OTOH, disposing an image on a working thread while the main thread is drawing it (in Plot.paintControl) might be a cause of synchronization bugs.

Instead of "synchronized (image)", synchronize with the message loop using asyncExec, something like:

    display.asyncExec(() -> {
        if (plot_image != null) plot_image.dispose();
        plot_image = image;
        // you can also call redraw() here instead of calling redrawSafely()
    });
Comment 3 Will Rogers CLA 2020-11-18 05:35:46 EST
Thank you for the responses.

The app is cs-studio and it is deprecated, but we expect still to have to maintain it for several years. It is not trivial to build but the following instructions should work:

* checkout swt-crash branch
* build according to the README
* launch the gtk product product/repository/target/products/cs-studio/linux/gtk/x86_64/cs-studio/cs-studio
* click on the toolbar button "Open new Databrowser plot"
* crash

I am able to work around the problem with the following commit, but this may slow down rendering and I would like to understand what is wrong with the code.

https://github.com/ControlSystemStudio/cs-studio/commit/46033af9a844eccceb9eb53b5e2da842947fcf0a

I will attempt to reduce the code to a smaller example.
Comment 4 Will Rogers CLA 2020-11-18 07:08:14 EST
I find that there is already a standalone demo for plotting. I have isolated two plugins and you can now reproduce this bug by

* cloning https://github.com/willrogers/swt-crash
* importing both plugins into Eclipse
* running TimePlotDemo from org.csstudio.swt.rtplot.test
Comment 5 Nikita Nemkin CLA 2020-11-18 08:42:41 EST
(In reply to Will Rogers from comment #4)
> I find that there is already a standalone demo for plotting. I have isolated
> two plugins and you can now reproduce this bug by
> 
> * cloning https://github.com/willrogers/swt-crash
> * importing both plugins into Eclipse
> * running TimePlotDemo from org.csstudio.swt.rtplot.test

This is very helpful. I can reproduce Xlib-related crash even on WSL+VcXsrv. So far it looks like an issue in SWT Image implementation.
Comment 6 Nikita Nemkin CLA 2020-11-18 09:12:21 EST
So the obvious problem is that Image(width, height) constructor uses gdk_window_create_similar_surface() to create its backing Cairo surface. This function produces XCB surfaces, which aren't thread-agnostic. I also doubt this function is thread-safe, since it takes a GtkWindow parameter.

An easy solution would be to always use pure software Cairo surfaces, which is already the case for GTK4 and Wayland. An opinion of someone experienced with X and GTK is welcome.
Comment 7 Alexandr Miloslavskiy CLA 2020-11-18 09:18:05 EST
I'm a bit confused, does this explain why it worked fine before (I understand, before removal of platform lock in SWT)? In my understanding, platform lock will merely prevent other threads from using the surface while it's being accessed by other thread. However, according to Will's description, image is prepared and then queued to be painted, that is (in my understanding) it shouldn't be accessed by 2 threads at once.

Also, with this knowledge, can you make a snippet?
Comment 8 Nikita Nemkin CLA 2020-11-18 10:15:07 EST
(In reply to Alexandr Miloslavskiy from comment #7)
> I'm a bit confused, does this explain why it worked fine before (I
> understand, before removal of platform lock in SWT)? In my understanding,
> platform lock will merely prevent other threads from using the surface while
> it's being accessed by other thread. However, according to Will's
> description, image is prepared and then queued to be painted, that is (in my
> understanding) it shouldn't be accessed by 2 threads at once.
> 
> Also, with this knowledge, can you make a snippet?
Unless automatic locking is requested with XInitThreads, XCB calls must be serialized. The resource here isn't any particular surface, but the display server communication chanel.
Since XCB calls are internal to GTK (and Cairo when using XCB surface backend), Platform lock guaranteed serialization by locking every GTK and Cairo function.
Without the lock, any display-related operations on the secondary thread, including creation and destruction of XCB surfaces, will race with the main thread.
gdk_window_create_similar_surface needs display, cairo_image_surface_create doesn't, so we have to use the latter to make SWT images thread-agnostic.

As for the snippet, a simple

    while (true) new Image(display, 100, 100).dispose();

on a worker thread is enough to trigger an error.
Comment 9 Eric Williams CLA 2020-11-18 14:05:11 EST
It's been awhile since I touched this code, but IIRC we were using cairo_image_surface_create() before and there were issues with HiDPI not working properly, as well as screenshots not rendering properly when running SWTBot tests.

For example there was this fix from bug 535123: https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/124241

The state of drawing in GTK3 is a bit of a mess, IMO it's easier to re-introduce locking for such functions than it is to convert SWT Images to use direct Cairo calls. There are too many corner cases and opportunities for breakages with the latter approach.

In GTK4 there is only one surface so it's a bit easier to manage than it is in GTK3.
Comment 10 Nikita Nemkin CLA 2020-11-18 14:57:14 EST
(In reply to Eric Williams from comment #9)
> It's been awhile since I touched this code, but IIRC we were using
> cairo_image_surface_create() before and there were issues with HiDPI not
> working properly, as well as screenshots not rendering properly when running
> SWTBot tests.

That's probably because cairo_image_surface_create() has to be paired with cairo_surface_set_device_scale().

Anyway, there's also gdk_window_create_similar_image_surface() which might be a better replacement for gdk_window_create_similar_surface(). Judging by the name, it creates display-independent surfaces.

> The state of drawing in GTK3 is a bit of a mess, IMO it's easier to
> re-introduce locking for such functions than it is to convert SWT Images to
> use direct Cairo calls. There are too many corner cases and opportunities
> for breakages with the latter approach.

As far as I can see, Image and GC do use Cairo almost exclusively and should work just fine with image surfaces. 

Locking is an all-or-nothing affair (GTK, GDK and Cairo). It would be rather difficult to separate functions that require display from the rest.
Comment 11 Alexandr Miloslavskiy CLA 2020-11-18 15:08:12 EST
Maybe make methods of Image 'synchronized' ?
Comment 12 Eclipse Genie CLA 2020-11-19 06:23:39 EST
New Gerrit change created: https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/172497
Comment 13 Nikita Nemkin CLA 2020-11-19 06:27:06 EST
(In reply to Alexandr Miloslavskiy from comment #11)
> Maybe make methods of Image 'synchronized' ?

This won't help, because other GDK and GTK functions are free to make XCB calls.


I've posted a patch that uses gdk_window_create_similar_image_surface(). It solves the problem, but needs more testing, especially on high DPI screens.
Comment 14 Serhiy _ CLA 2021-01-07 19:18:55 EST
After spending two days trying every solution I could find I finally found this. Confirming that this is a serious issue that renders Eclipse unusable. Can be workarounded by switching to Wayland but fonts are horrible on Eclipse there without GNOME installed (I'd really rather not). This contains the fonts example: https://stackoverflow.com/a/65621728/854477
Comment 15 Serhiy _ CLA 2021-01-07 19:20:42 EST
Is this a duplicate of https://bugs.eclipse.org/bugs/show_bug.cgi?id=495896 ?
Comment 16 Serhiy _ CLA 2021-01-09 18:05:21 EST
Downgrading from the Eclipse interface didn't work https://stackoverflow.com/q/65645059/854477

So I just downloaded Eclipse 2020-06 release anew and it seems to be working.
Comment 18 Eclipse Genie CLA 2021-08-13 11:32:52 EDT
New Gerrit change created: https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/183997
Comment 19 Andrey Loskutov CLA 2021-08-13 12:11:23 EDT
(In reply to Eclipse Genie from comment #17)
> Gerrit change
> https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/172497 was
> merged to [master].
> Commit:
> http://git.eclipse.org/c/platform/eclipse.platform.swt.git/commit/
> ?id=94219602a3ecaf70f0d9c72c94302c7e3ce79ad7

This change causes severe regression in the IDE, see bug 575402, line number ruler doesn't properly repaint itself anymore. The IDE code in question is in org.eclipse.jface.text.source.LineNumberRulerColumn.doubleBufferPaint(GC).

With the "bad" patch below, that switches off buffering of the line numbers, the IDE works again:

diff --git a/org.eclipse.jface.text/src/org/eclipse/jface/text/source/LineNumberRulerColumn.java b/org.eclipse.jface.text/src/org/eclipse/jface/text/source/LineNumberRulerColumn.java
index 346432f..10f3a67 100644
--- a/org.eclipse.jface.text/src/org/eclipse/jface/text/source/LineNumberRulerColumn.java
+++ b/org.eclipse.jface.text/src/org/eclipse/jface/text/source/LineNumberRulerColumn.java
@@ -732,2 +732,3 @@
 			}
+			dy= 0; // XXX
 			// dy == 0 means now "draw everything", either because there was no overlap, or we indeed didn't move

The code in question creates some buffer image to draw and new GC:

GC bufferGC= new GC(fBuffer);
Image newBuffer= new Image(fCanvas.getDisplay(), size.x, size.y);
GC localGC= new GC(newBuffer);
try {
	initializeGC(localGC, 0, bufferY, size.x, bufferH);
	doPaint(localGC, visibleLines);
} finally {
	localGC.dispose();
}
bufferGC.drawImage(newBuffer, 0, bufferY, size.x, bufferH, 0, bufferY, size.x, bufferH);

I *guess* it is the call to GC.drawImage that tries to draw the new buffer is broken now:
org.eclipse.swt.graphics.GC.drawImage(Image, int, int, int, int, int, int, int, int, boolean)

(In reply to Eclipse Genie from comment #18)
> New Gerrit change created:
> https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/183997

This is the revert of the original patch, to fix the regression for M3 next week (Monday last day for "big" changes).
Comment 20 Nikita Nemkin CLA 2021-08-13 12:15:05 EDT
Revert is the best for now. I'll investigate the breakage.
Comment 22 Serhiy _ CLA 2021-08-13 16:27:56 EDT
_This_ issue is a severe regression that prevents Eclipse from being usable at all at least on Linux. Compared to that a ruler being non-functional in just a mild inconvenience :)
Comment 23 Alexander Kurtakov CLA 2021-08-14 01:09:45 EDT
(In reply to Serhiy _ from comment #22)
> _This_ issue is a severe regression that prevents Eclipse from being usable
> at all at least on Linux. Compared to that a ruler being non-functional in
> just a mild inconvenience :)

Can you please try out whether issue is reproducible with https://flathub.org/apps/details/org.eclipse.Java ? Is this under X11 or Wayland?
Comment 24 Serhiy _ CLA 2021-08-29 11:14:35 EDT
It's no longer reproducible on 2021-06 it seems. I've installed it separately and it worked. Then I've updated my main installation by re-enabling http://download.eclipse.org/releases/latest repo and it's no longer freezing. Not planning to test flatpak as I don't have flatpak installed. I had problems on X11. Wayland was fine in this regard but had fonts quite broken like this https://cdn.discordapp.com/attachments/571406924587728924/574626870339960833/unknown.png Haven't tested the new version with Wayland yet.