Bug 569853 - jstack occasionally crashes Eclipse when running on Java 11
Summary: jstack occasionally crashes Eclipse when running on Java 11
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: SWT (show other bugs)
Version: 4.15   Edit
Hardware: PC Linux
: P3 critical (vote)
Target Milestone: 4.19 M1   Edit
Assignee: Andrey Loskutov CLA
QA Contact:
URL:
Whiteboard:
Keywords: regression
: 569757 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-12-21 04:27 EST by Simeon Andreev CLA
Modified: 2022-03-10 11:43 EST (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simeon Andreev CLA 2020-12-21 04:27:14 EST
See https://bugs.eclipse.org/bugs/show_bug.cgi?id=563990#c3 and https://bugzilla.redhat.com/show_bug.cgi?id=1897150.

The first SWT commit I can reproduce the crash with is the fix for bug 540060: https://git.eclipse.org/c/platform/eclipse.platform.swt.git/commit/?id=558be9a19c6de3c914ccbed0ac541d5c849bf1f5

To clarify which part of the native code causes the crash and whether more is broken than "just" native threads.
Comment 1 Simeon Andreev CLA 2020-12-21 04:34:20 EST
Crash looks like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ffff79b5e90, pid=4895, tid=4910
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.8+10) (build 11.0.8+10-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.8+10-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libpthread.so.0+0xce90]  pthread_getcpuclockid+0x0
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h" (or dumping to /home/sandreev/eclipses/eclipse4_14SDK/core.4895)
#
# An error report file with more information is saved as:
# /home/sandreev/eclipses/eclipse4_14SDK/hs_err_pid4895.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Comment 2 Andrey Loskutov CLA 2020-12-21 05:50:29 EST
See also bug 569757, probably related or same.

Common thing is that the crash occurs at same native code (here the crash dump from crash on jstack, I have no JMC dump):

Stack: [0x00007fffb70a4000,0x00007fffb71a4000],  sp=0x00007fffb71a28f8,  free space=1018k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libpthread.so.0+0xce90]  pthread_getcpuclockid+0x0
V  [libjvm.so+0xe8a751]  Thread::print_on(outputStream*, bool) const+0x51
V  [libjvm.so+0xe8cfc6]  JavaThread::print_on(outputStream*, bool) const+0xe6
V  [libjvm.so+0xe8fd18]  Threads::print_on(outputStream*, bool, bool, bool, bool)+0x668
V  [libjvm.so+0xf04cd0]  VM_Operation::evaluate()+0xe0
V  [libjvm.so+0xf0269f]  VMThread::evaluate_operation(VM_Operation*)+0x11f
V  [libjvm.so+0xf02af5]  VMThread::loop()+0x265
V  [libjvm.so+0xf0302c]  VMThread::run()+0x7c
V  [libjvm.so+0xe90fe5]  Thread::call_run()+0x155
V  [libjvm.so+0xc1a878]  thread_native_entry(Thread*)+0xf8

In a similar JDK bug https://bugs.openjdk.java.net/browse/JDK-8206954 was a comment about how the crash could happen: https://bugs.openjdk.java.net/browse/JDK-8206954?focusedCommentId=14194417&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14194417 :

"The errant native thread in the new test, whilst still freshly terminated triggers the ESRCH and the new test passes cleanly. But the JavaThread for that will stay around until the VM terminates - as that's what happens if you fail to detach. So when the ThreadPriorities test runs and issues a jstack, that will still find the terminated but still attached thread from the earlier test. Only now the pthread_t points to umapped memory and so the validity check in pthread_getcpuclockid gets a SEGV."

Looking on SWT code, we only have AttachCurrentThread calls in eclipse.platform.swt/                                                                                                                                                                                
bundles/org.eclipse.swt/Eclipse SWT/common/library/callback.c

So could it be, the new GDBus based code allows webkit / gdbus to use some callback in SWT but that callback *never* detaches the native thread from JVM?

Could it be, new patch uncovered existing issue with org.eclipse.swt/Eclipse SWT/common/library/callback.c (that is probably used here)?

Disclaimer: I have absolutely no knowledge in JNI API used, so below are very naive questions I have looking on related SWT code.

In 
https://git.eclipse.org/c/platform/eclipse.platform.swt.git/tree/bundles/org.eclipse.swt/Eclipse%20SWT/common/library/callback.c?id=558be9a19c6de3c914ccbed0ac541d5c849bf1f5#n1200

I see:

#ifdef JNI_VERSION_1_4
	if (env == NULL) {
		if (JNI_VERSION >= JNI_VERSION_1_4) {
			(*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
		}
	}
#endif
	
	if (env == NULL) {
		(*jvm)->AttachCurrentThread(jvm, (void **)&env, NULL);
		if (IS_JNI_1_2) detach = 1;
	}

Does it mean, we call *both* AttachCurrentThreadAsDaemon and AttachCurrentThread if JNI_VERSION_1_4 is defined? Is this expected / good thing to do? Why it is not enough to AttachCurrentThreadAsDaemon only? Shouldn't we just remove AttachCurrentThread call and if (IS_JNI_1_2) check, because we don't support JDK < 8 anyway?

So shouldn't the block above be simply that:

if (env == NULL) {
	(*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
	detach = 1;
}

The only documentation I found is https://docs.oracle.com/en/java/javase/11/docs/specs/jni/invocation.html#attachcurrentthreadasdaemon

Also IS_JNI_1_2 is only ever set to 1 in JNI_OnLoad function - could it be, that is "not visible yet" for the webkit/gdbus thread code at the moment where AttachCurrentThread is called, so detach is not set yet to 1 (because IS_JNI_1_2 is still zero for the current thread) and DetachCurrentThread is never called?

And finally: let assume some exception happens with (webkit/gdbus) callback after it calls AttachCurrentThread* - would it mean, the DetachCurrentThread is not called? Is there a way to do something like "finally" in C code?
Comment 3 Simeon Andreev CLA 2020-12-21 06:24:25 EST
See change below.

With only the prints, I see a single call to AttachCurrentThreadAsDaemon without a matching detach call (Eclipse has done nothing, other than show the welcome HTML). I can reproduce the jstack crash almost instantly (usually on the first call).

If I add a matching detach call (also see change), I can call jstack a few times and no crashes occur. I don't know if the problem is gone as its sporadic, but on first glance maybe its gone).

diff --git a/bundles/org.eclipse.swt/Eclipse SWT/common/library/callback.c b/bundles/org.eclipse.swt/Eclipse SWT/common/library/callback.c
index 87b46e9caf..8b98bd2889 100644
--- a/bundles/org.eclipse.swt/Eclipse SWT/common/library/callback.c     
+++ b/bundles/org.eclipse.swt/Eclipse SWT/common/library/callback.c     
@@ -17,6 +17,9 @@
  */
 #include "callback.h"
 #include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <pthread.h>
 
 #ifndef CALLBACK_NATIVE
 #define CALLBACK_NATIVE(func) Java_org_eclipse_swt_internal_Callback_##func
@@ -24,7 +27,6 @@
 
 /* define this to print out debug statements */
 /* #define DEBUG_CALL_PRINTS */
-
 /* --------------- callback globals ----------------- */
 
 static JavaVM *jvm = NULL;
@@ -1193,12 +1195,15 @@ jintLong callback(int index, ...)
        if (env == NULL) {
                if (JNI_VERSION >= JNI_VERSION_1_4) {
                        (*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
+                       fprintf(stderr, "AttachCurrentThreadAsDaemon %d\n", pthread_self());
+                       if (IS_JNI_1_2) detach = 1;
                }
        }
 #endif
        
        if (env == NULL) {
                (*jvm)->AttachCurrentThread(jvm, (void **)&env, NULL);
+               fprintf(stderr, "AttachCurrentThread %d\n", pthread_self());
                if (IS_JNI_1_2) detach = 1;
        }
        
@@ -1264,6 +1269,7 @@ done:
 
        if (detach) {
                (*jvm)->DetachCurrentThread(jvm);
+               fprintf(stderr, "DetachCurrentThread %d\n", pthread_self());
                env = NULL;
        }
Comment 4 Andrey Loskutov CLA 2020-12-21 06:42:40 EST
@Simeon: please provide gerrit with the change, we can continue discussion there.

Have you tried to replace the code block mentioned in comment 2 (both AttachCurrentThreadAsDaemon & AttachCurrentThread) with only this:

if (env == NULL) {
	(*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
	detach = 1;
}
Comment 5 Eclipse Genie CLA 2020-12-21 06:53:28 EST
New Gerrit change created: https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023
Comment 6 Simeon Andreev CLA 2020-12-21 07:01:34 EST
(In reply to Andrey Loskutov from comment #4)
> Have you tried to replace the code block mentioned in comment 2 (both
> AttachCurrentThreadAsDaemon & AttachCurrentThread) with only this:
> 
> if (env == NULL) {
> 	(*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
> 	detach = 1;
> }

No. The code is from before 2009, I don't know why the daemon attach is preferred for JNI 1.4. I'm guessing we could remove the handling for JNI 1.2, but I wouldn't touch it (I don't see why).

The Oracle JDK documentation specifies the versions as:

(
https://docs.oracle.com/javase/9/docs/specs/jni/functions.html#version-constants)

In JDK/JRE 1.1, GetVersion() returns JNI_VERSION_1_1.
In JDK/JRE 1.2, GetVersion() returns JNI_VERSION_1_2.
In JDK/JRE 1.4, GetVersion() returns JNI_VERSION_1_4.
In JDK/JRE 1.6, GetVersion() returns JNI_VERSION_1_6.
In JDK/JRE 1.8, GetVersion() returns JNI_VERSION_1_8.
In JDK/JRE 9, GetVersion() returns JNI_VERSION_9.
Comment 7 Andrey Loskutov CLA 2020-12-21 10:05:54 EST
(In reply to Eclipse Genie from comment #5)
> New Gerrit change created:
> https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023

While discussing the bug internally with Simeon, we've found the root cause for "missing" detach code, see bug 262985 comment 4.

The AttachCurrentThreadAsDaemon code was added to fix some Mac specific debugger issue, and it *doesn't want* that DetachCurrentThread is called!!!

Since I'm not a C developer, I haven't noticed before that the AttachCurrentThreadAsDaemon call *writes* to the given "env" pointer, so the second if block is simply *never* executed:

#ifdef JNI_VERSION_1_4
	if (env == NULL) {
		if (JNI_VERSION >= JNI_VERSION_1_4) {
			(*jvm)->AttachCurrentThreadAsDaemon(jvm, (void **)&env, NULL);
		}
	}
#endif
	

///// the env os NOT NULl here, so "detach" is never set to 1!

	if (env == NULL) {
		(*jvm)->AttachCurrentThread(jvm, (void **)&env, NULL);
		if (IS_JNI_1_2) detach = 1;
	}

So the current state of SWT code is: on every platform the call AttachCurrentThreadAsDaemon is executed without ever calling DetachCurrentThread afterwards, because of the Mac specific fix for bug 262985 in Eclipse 3.5.

I have no Mac and I wonder if someone on the CC could test the proposed patch on Mac:
https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023

The key changes:

1) Always call DetachCurrentThread. That is a main change, I have no idea if that would break now anything on Mac (bug 262985 comment 4 was reported for a pretty old version of JDK/Mac). As of today, without the patch, DetachCurrentThread is never called even if we always call AttachCurrentThreadAsDaemon. Because of this, and because of the changes in Java 9+ the code on Linux JVM port that uses pthread_getcpuclockid crashes while using obsolete native thread reference.

2) The code in the JNI_OnLoad callback returns now always JNI_VERSION_1_4 :
https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023/2/bundles/org.eclipse.swt/Eclipse+SWT/common/library/swt.c#19

as specified in: https://docs.oracle.com/javase/9/docs/specs/jni/invocation.html#jni_onload
https://docs.oracle.com/en/java/javase/11/docs/specs/jni/invocation.html#jni_onload

"In order to make use of functions defined at a certain version of the JNI API, JNI_OnLoad must return a constant defining at least that version. For example, libraries wishing to use AttachCurrentThreadAsDaemon function introduced in JDK 1.4, need to return at least JNI_VERSION_1_4".

Interestingly, this is NOT mentioned in the Java 8 version:
https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html#JNJI_OnLoad

So from jstack point of view the native thread that is attached to the JVM should always detach itself - and if it doesn't and the native thread terminates, jstack will happily crash. Unfortunately JNI specs above do not say "detaching is mandatory", but of course it makes sense. Also it makes sense that a native application thread might be terminated without a chance to be detached from JVM, and JVM shouldn't be shocked by that.

On our RH issue workaround was proposed to switch UseLinuxPosixThreadCPUClocks flag off via -XX:-UseLinuxPosixThreadCPUClocks . That should avoid that jstack uses pthread_getcpuclockid API and reads thread data from the /proc system (which is 10x slower if we believe this post: https://blog.packagecloud.io/eng/2017/03/14/using-strace-to-understand-java-performance-improvement/).

This is of course not acceptable for us (Advantest). We should either fix Eclipse or JVM or both. Ideally JVM should be fixed, because SWT code in question is not the only native code that may "forget" about DetachCurrentThread.

But looking on the progress of https://bugzilla.redhat.com/show_bug.cgi?id=1897150 (that was reported as https://bugzilla.redhat.com/show_bug.cgi?id=1832121 more than a half year ago), I would prefer we would *also* fix SWT.

For SWT developers on CC: 

1) Could we please check proposed change https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023 ?
2) Could anyone test it on Mac & Windows (we tried on Linux RHEL 7.4 and it seem to work fine, but we did not run all the SDK tests)?
3) In case we won't get answers for 2), should we restrict patch for Linux?
4) Should we change the patch & provide a system flag for SWT so we can enable it on demand only for broader testing?
Comment 8 Simeon Andreev CLA 2020-12-21 10:13:08 EST
(In reply to Andrey Loskutov from comment #7)
> Unfortunately JNI specs above do not
> say "detaching is mandatory", but of course it makes sense.

This is actually specified here:

https://docs.oracle.com/javase/9/docs/specs/jni/invocation.html#detaching-from-the-vm

> Detaching from the VM
>
> A native thread attached to the VM must call DetachCurrentThread() to detach 
> itself before exiting. A thread cannot detach itself if there are Java methods on 
> the call stack.


This might be "new" documentation though, I don't know.

For completeness, here is the paragraph for attaching:

https://docs.oracle.com/javase/9/docs/specs/jni/invocation.html#attaching-to-the-vm
Comment 9 Andrey Loskutov CLA 2020-12-21 11:14:10 EST
Bug 264535 comment 16 has a small snippet to validate the original debugger issue on Mac. 
Sravan, could you please check on Mac if the patch https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023 causes the regression in debugger? 

If not, we would not need another complicated patch / extra native thread death listeners etc and could go with the proposed simple patch.
Comment 10 Sravan Kumar Lakkimsetti CLA 2020-12-21 12:21:10 EST
(In reply to Andrey Loskutov from comment #9)
> Bug 264535 comment 16 has a small snippet to validate the original debugger
> issue on Mac. 
> Sravan, could you please check on Mac if the patch
> https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174023 causes
> the regression in debugger? 
> 
> If not, we would not need another complicated patch / extra native thread
> death listeners etc and could go with the proposed simple patch.

I don't see any regression with the proposed patch on Mac. In my opinion we can go ahead with the proposed patch (as soon as the review comments are taken care of.
Comment 12 Eclipse Genie CLA 2020-12-22 06:41:18 EST
New Gerrit change created: https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174052
Comment 13 Andrey Loskutov CLA 2020-12-22 06:47:55 EST
(In reply to Eclipse Genie from comment #12)
> New Gerrit change created:
> https://git.eclipse.org/r/c/platform/eclipse.platform.swt/+/174052

This fixes a critical issue that is actually unrelated to the patch itself, but to the #define DEBUG_CALL_PRINTS definition that was used in debug prints.

Change on the build script was required to compile callback.c with
#define DEBUG_CALL_PRINTS for debug output.

Turned out, this change compiles something that GTK code can't
understand & so it crashes with SIGSEGV on opening native file dialog.

# Problematic frame:
# C  [libglib-2.0.so.0+0x39757]  g_path_is_absolute+0x7

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.eclipse.swt.internal.gtk.GTK.gtk_file_chooser_set_current_folder(JJ)V+0
j  org.eclipse.swt.widgets.FileDialog.presetChooserDialog()V+599
j  org.eclipse.swt.widgets.FileDialog.openNativeChooserDialog()Ljava/lang/String;+104
j  org.eclipse.swt.widgets.FileDialog.open()Ljava/lang/String;+1
j  org.eclipse.ui.internal.ide.actions.OpenLocalFileAction.run()V+35
j  org.eclipse.ui.internal.ide.actions.OpenLocalFileAction.run(Lorg/eclipse/jface/action/IAction;)V+1

Reverting the build script it works again. I will open a new bug for that.
Comment 14 Andrey Loskutov CLA 2020-12-22 06:56:29 EST
(In reply to Andrey Loskutov from comment #13)
> Reverting the build script it works again. I will open a new bug for that.

Opened bug 569872.
Comment 16 Andrey Loskutov CLA 2021-01-04 07:08:25 EST
*** Bug 569757 has been marked as a duplicate of this bug. ***