Bug 534923 - objc_msgSend is called incorrectly from os.c in macOS SWT
Summary: objc_msgSend is called incorrectly from os.c in macOS SWT
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: SWT (show other bugs)
Version: 4.8   Edit
Hardware: Macintosh Mac OS X
: P3 normal (vote)
Target Milestone: 4.11 M3   Edit
Assignee: Thomas Wolf CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-21 23:40 EDT by Alexei Svitkine CLA
Modified: 2020-03-16 13:51 EDT (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexei Svitkine CLA 2018-05-21 23:40:22 EDT
The vararg function objc_msgSend() is called incorrectly from os.c in macOS SWT.

In os.c, it is invoked by casting it to a specific function signature before calling it, for example:

rc = (jintLong)((jintLong (*)(jintLong, jintLong, jintLong, jintLong, jintLong))objc_msgSend)(arg0, arg1, arg2, arg3, arg4);

However, objc_msgSend is a vararg function and the calling convention between a vararg function and a non-vararg function is not guaranteed to be the same. In particular, on x86_64 it is different.

From the ABI spec (https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf):

"For calls that may call functions that use varargs or stdargs (prototype-less
calls or calls to functions containing ellipsis (. . . ) in the declaration) %al16 is used
as hidden argument to specify the number of vector registers used. The contents
of %al do not need to match exactly the number of registers, but must be an upper
bound on the number of vector registers used and is in the range 0–8 inclusive."

So, the above specifies that for vararg function calls, al (which is part of eax and part of  thus eax and thus rax) have a meaning. This same requirement does not exist for non-vararg calls.

This can be seen by compiling these two functions with GCC:

jintLong func1(void)
{
	jintLong a = 1, b = 2, c = 3, d = 4, e = 5;
	return (jintLong)objc_msgSend(a, b, c, d, e);
}

jintLong func2(void)
{
	jintLong a = 1, b = 2, c = 3, d = 4, e = 5;
	return (jintLong)((jintLong (*)(jintLong, jintLong, jintLong, jintLong, jintLong))objc_msgSend)(a, b, c, d, e);
}

If you look at their disassembly, you can see that it is different:

Dump of assembler code for function func1:
0x00000001006f72f0 <func1+0>:	push   %rbp
0x00000001006f72f1 <func1+1>:	mov    %rsp,%rbp
0x00000001006f72f4 <func1+4>:	mov    $0x5,%r8d
0x00000001006f72fa <func1+10>:	mov    $0x4,%ecx
0x00000001006f72ff <func1+15>:	mov    $0x3,%edx
0x00000001006f7304 <func1+20>:	mov    $0x2,%esi
0x00000001006f7309 <func1+25>:	mov    $0x1,%edi
0x00000001006f730e <func1+30>:	xor    %eax,%eax
0x00000001006f7310 <func1+32>:	leaveq 
0x00000001006f7311 <func1+33>:	jmpq   0x100791dc8 <dyld_stub_objc_msgSend>
End of assembler dump.

Dump of assembler code for function func2:
0x00000001006f72cc <func2+0>:	push   %rbp
0x00000001006f72cd <func2+1>:	mov    %rsp,%rbp
0x00000001006f72d0 <func2+4>:	mov    $0x5,%r8d
0x00000001006f72d6 <func2+10>:	mov    $0x4,%ecx
0x00000001006f72db <func2+15>:	mov    $0x3,%edx
0x00000001006f72e0 <func2+20>:	mov    $0x2,%esi
0x00000001006f72e5 <func2+25>:	mov    $0x1,%edi
0x00000001006f72ea <func2+30>:	leaveq 
0x00000001006f72eb <func2+31>:	jmpq   0x100791dc8 <dyld_stub_objc_msgSend>
End of assembler dump.

The difference is the extra "xor    %eax,%eax" call in func1. This is to clear the "al" register to specify that vector registers aren't being used. Without it, whatever value is left over in that register would be in effect, which can cause a crash.

The fix is to change os.c to not cast the objc_msgSend function and just call it normally, as is done in func1.
Comment 1 Thomas Wolf CLA 2019-02-01 06:21:07 EST
Is anybody looking at this? Calling conventions for C vararg functions in general may be different from non-vararg functions. That has "always" been that way; IIRC at least since ANSI C 89. Calling a vararg function through a non-vararg prototype has always been wrong. It may or may not work; the standard calls this "undefined behavior". See [1], p. 82: "If a function that accepts a variable number of arguments is defined without a parameter type list that ends with the ellipsis notation, the behavior is undefined."

I wonder if this might be the cause of various strange bugs that have been reported over the years with objc_msgSend and objc_msgSendSuper calls (typically causing Eclipse to hang), such as bug 485337, bug 476140, bug 483379, bug 511901, bug 418021, bug 486875, bug 443949, and possibly others.

[1] https://www.pdf-archive.com/2014/10/02/ansi-iso-9899-1990-1/ansi-iso-9899-1990-1.pdf
Comment 2 Andrey Loskutov CLA 2019-02-01 08:54:57 EST
Thomas, if you have a Mac, feel free to contribute a patch. There aren't so many SWT developers with Macs.
Comment 3 Thomas Wolf CLA 2019-02-02 12:20:49 EST
(In reply to Andrey Loskutov from comment #2)
> Thomas, if you have a Mac, feel free to contribute a patch. There aren't so
> many SWT developers with Macs.

Hmph. OK, I have a working patch to MacGenerator and NativesGenerator that generate these files (the Cocoa OS.java and os.c). The correct solution is to leave the casts in because objc_msgSend_stret and objc_msgSendSuper_stret must
be casted in any case. However, I changed the cast to a varargs prototype as
in

  ((jintLong (*)(jintLong, jintLong, ...)) objc_msgSend)(a, b, c, d, e)

That also produces the correct code including clearing %al, and does so also
for the *_stret calls.

Now, can someone tell me how the SWT Jenkins builds work? What do I have to push to Gerrit to have this work? Do I have to push the changed generator and the new generated OS.java and os.c? Or only the generator? And do I have to also push the new generated binary libraries? And what needs to pushed together or what comes after what?
Comment 4 Thomas Wolf CLA 2019-02-03 06:30:43 EST
Also, job eclipse.platform.swt-Gerrit seems to run on a CentOS node. How and where would these OS X-specific changes be built and be automatically tested at all? Which job builds the OS X jnilibs?
Comment 5 Lakshmi P Shanmugam CLA 2019-02-04 01:23:10 EST
Hi Thomas, 
Thanks for working on this!

(In reply to Thomas Wolf from comment #3)
>
> Now, can someone tell me how the SWT Jenkins builds work? What do I have to
> push to Gerrit to have this work? Do I have to push the changed generator
> and the new generated OS.java and os.c? Or only the generator? And do I have
> to also push the new generated binary libraries? And what needs to pushed
> together or what comes after what?
Please push the changes to the Generator and the generated OS.java and os.c files to gerrit in the same gerrit change.
You *don't* have to push the newly built binaries. They'll be built by the SWT build jobs (https://hudson.eclipse.org/releng/view/SWT%20Natives/) as part of the I-build. Specifically, https://hudson.eclipse.org/releng/view/SWT%20Natives/job/cocoa_x86_64/ builds the Cocoa native libraries.

(In reply to Thomas Wolf from comment #4)
> Also, job eclipse.platform.swt-Gerrit seems to run on a CentOS node. How and
> where would these OS X-specific changes be built and be automatically tested
> at all? Which job builds the OS X jnilibs?
Unfortunately, the Mac (and Windows) changes are not built or automatically tested as part of the Gerrit jobs. They'll be tested as part of the I-build after the changes are merged. Please run the tests locally to make sure they pass.
Comment 6 Eclipse Genie CLA 2019-02-04 02:41:57 EST
New Gerrit change created: https://git.eclipse.org/r/136214
Comment 7 Thomas Wolf CLA 2019-02-05 02:34:27 EST
BTW it's not just about clearing %al, but also about setting %al correctly when values _are_ passed in vector registers. (All objc_msgSend* calls that have NSPoint or NSRect arguments; in the C code the *_JJD* variants.) For those %al is not set unless a varargs prototype is used. With a varags prototype, %al is set correctly now.
Comment 9 Lakshmi P Shanmugam CLA 2019-02-05 05:48:48 EST
Thanks for the patch, Thomas!
Marking as fixed.
Comment 10 Lakshmi P Shanmugam CLA 2019-02-05 05:54:48 EST
Thanks Alexei for reporting this!
Comment 11 Lakshmi P Shanmugam CLA 2019-02-20 01:04:18 EST
Verified in I20190218-1800.
Comment 12 Nikita Nemkin CLA 2019-03-28 11:07:24 EDT
I'm sorry, but this change has to be reverted. It's based on an incorrect premise that functions of objc_msgSend family are variadic. They are NOT.

objc_msgSend perfectly forwards function arguments for arbitrary signatures. Its formal C declaration is not meant to be used directly. Instead, it should always be cast to whatever the target method signature is, exactly like the old code did before this change.

See the official documentation, section "Cast Objective-C Messages in Proper Form": https://developer.apple.com/documentation/uikit/core_app/updating_your_app_from_32-bit_to_64-bit_architecture/managing_functions_and_function_pointers?language=objc

Also, if you look at compiler generated msgSend calls, you'll see that they are not variadic (unless the target method itself is variadic).
Comment 13 Dani Megert CLA 2019-03-31 13:14:05 EDT
(In reply to Nikita Nemkin from comment #12)
You need to file a new bug as 4.11 is done and we don't want to change history.
Comment 14 Alexei Svitkine CLA 2019-03-31 13:48:13 EDT
Re: comment 12

I have observed that if objc_msgSend is called with a garbage value in eax (i.e. left over from some other code), it can indeed crash on x86_64, due to presumably using that value for its vararg meaning. The vararg calling convention does clear eax which solves the problem (whereas the casted version doesn't - see examples in the original issue).

So not sure about the linked docs - what I see in practice is this does matter. Although this may not happen with OpenJDK and those specific JNI calls because it seems OpenJDK might actually clear eax before invoking JNI and the os.c JNI functions don't currently clobber eax. But eax being cleared when invoking JNI is not guaranteed to be the case - for example Avian jvm does not clear it.
Comment 15 Nikita Nemkin CLA 2019-03-31 15:09:07 EDT
(In reply to Alexei Svitkine from comment #14)
> Re: comment 12

Can you demonstrate these crashes you're seing? What you're describing is a total breach of ABI rules and is very hard to believe. SWT worked just fine before this change.

ObjC compiler doesn't initialize rax when calling non-variadic methods: https://godbolt.org/z/PRx1M5, I see no reason why SWT should do so.
Comment 16 Alexei Svitkine CLA 2019-03-31 16:10:34 EDT
"Can you demonstrate these crashes you're seing?"

I can try to get a repro. It's in the context of a bigger project, so will try to isolate.

"What you're describing is a total breach of ABI rules and is very hard to believe."

Actually the ABI states that variadic functions should be setting eax according to the number of vector registers being passed. The discussion is around whether objc_msgSend() actually follows the ABI per its C signature.

"SWT worked just fine before this change."

I mentioned I was seeing the issues with Avian VM, not with other VMs (presumably because their JNI calling impls might already have been clearing eax?). I assume you've not actually tested SWT with Avian on x86_64, right? So this statement is not useful.

"ObjC compiler doesn't initialize rax when calling non-variadic methods: https://godbolt.org/z/PRx1M5, I see no reason why SWT should do so."

Thanks for the example, it does seem there's some inconsistency with the results I was seeing here. Let me see if I can get a small repro for what I was seeing. Perhaps its dependent on the signature of what's being called.
Comment 17 Alexei Svitkine CLA 2019-03-31 16:54:11 EDT
In the context of the project where I originally encountered the issue, I can longer reproduce the crashes I was seeing that were fixed when I modified my local copy of SWT's os.c to use variadic calling conventions.

However, the current build is using a newer version of Avian VM than what was being used when I was seeing those issues. So maybe something was changed in that VM that is responsible for the discrepancies.

So at this point, given that I can't repro the problems anymore and the evidence that Nikita provided that casting to a non-variadic signature is actually correct, it does seem like this should be reverted back. Sorry for the trouble. :(
Comment 18 Nikita Nemkin CLA 2019-03-31 16:57:59 EDT
(In reply to Alexei Svitkine from comment #16)

Wasn't Avian abandoned a long time ago? To be honest, I didn't expect Eclipse to work on Avian at all.
Comment 19 Alexei Svitkine CLA 2019-03-31 17:01:15 EDT
(In reply to Nikita Nemkin from comment #18)
> (In reply to Alexei Svitkine from comment #16)
> 
> Wasn't Avian abandoned a long time ago? To be honest, I didn't expect
> Eclipse to work on Avian at all.

To clarify, this was for SWT being used as a library in a different project, not all of Eclipse. It works just fine. :) It's nice to ship a Java-based app with 2MB of runtime instead of 200MB. :)
Comment 20 Alexei Svitkine CLA 2019-03-31 17:07:42 EDT
Actually, looks like I spoke too soon.

Had some user error in my testing and am actually seeing the crashes from before after reverting my manual os.c changes. Let's see if I could get a repro.
Comment 21 Alexei Svitkine CLA 2019-03-31 18:04:52 EDT
Okay, so looks like the offending call was +[NSString stringWithFormat:], which was being called via JNI objc_1msgSend__JJJJ on x86_64.

So the issue is more subtle than the original report. The original code was correct for most calls - but variadic calls themselves (like stringWithFormat) were being handled incorrectly.

So the right fix would be to revert the general case changes back to non-variadic calls, but introduce variadic calls for the cases that need them like stringWithFormat:.
Comment 22 Lakshmi P Shanmugam CLA 2019-04-09 01:27:14 EDT
@Nikita, 
https://git.eclipse.org/r/#/c/140108/ - only reverts the changes in this bug. 
But, according to comment#21, another fix is required for completion. Do you agree? Do you plan to provide the fix?
Comment 23 Nikita Nemkin CLA 2019-04-09 01:45:42 EDT
(In reply to Lakshmi Shanmugam from comment #22)
> @Nikita, 
> https://git.eclipse.org/r/#/c/140108/ - only reverts the changes in this
> bug. 
> But, according to comment#21, another fix is required for completion. Do you
> agree? Do you plan to provide the fix?

I plan to remove all NSString.stringWithFormat calls from SWT. There are only a few (in Display code). This should fix Alexei's crashes on Avian.

I don't plan to add variadic support to JNIGen.
Comment 24 Lakshmi P Shanmugam CLA 2019-04-16 01:45:11 EDT
(In reply to Nikita Nemkin from comment #23)
> (In reply to Lakshmi Shanmugam from comment #22)
> > @Nikita, 
> > https://git.eclipse.org/r/#/c/140108/ - only reverts the changes in this
> > bug. 
> > But, according to comment#21, another fix is required for completion. Do you
> > agree? Do you plan to provide the fix?
> 
> I plan to remove all NSString.stringWithFormat calls from SWT. There are
> only a few (in Display code). This should fix Alexei's crashes on Avian.
> 
> I don't plan to add variadic support to JNIGen.

Can you please provide a patch for removing NSString.stringWithFormat? I think we should first fix this before reverting the patch.

This would only fix the crash caused by NSString.stringWithFormat. I've not checked if there are other similar functions in use in SWT, but if there are any, then they would crash too after reverting the patch, right?
Comment 25 Nikita Nemkin CLA 2019-04-16 03:37:02 EDT
(In reply to Alexei Svitkine from comment #21)

Alexei, I figured out why you only get crashes on Avian. The code using stringWithFormat (Display.createMainMenu) never runs on Oracle Java and derivatives. It only runs when both the JVM and the app bundle lack DefaultApp.nib file, which SWT expects to provide the default menu.

(In reply to Lakshmi Shanmugam from comment #24)

SWT declares a few variadic ObjC functions, but never actually uses them. I don't expect any new crashes to surface. After all, Cocoa port did fine for 12 years without the "fix" that's being reverted.
Comment 26 Eclipse Genie CLA 2019-04-16 03:37:08 EDT
New Gerrit change created: https://git.eclipse.org/r/140652
Comment 27 Lakshmi P Shanmugam CLA 2019-04-19 04:14:34 EDT
(In reply to Nikita Nemkin from comment #25)
> (In reply to Lakshmi Shanmugam from comment #24)
> 
> SWT declares a few variadic ObjC functions, but never actually uses them. I
> don't expect any new crashes to surface. After all, Cocoa port did fine for
> 12 years without the "fix" that's being reverted.

Thanks for the patch. I didn't say or imply that new crashes would surface on reverting the patch.
The patch was created in the first place as a possible fix for the crashes/hang issues seen in Cocoa port in the past (see comment#1 for details). 
So, my question was to clarify whether calls to other variadic functions would crash after the patch was reverted.
Since the patch removes the variadic functions in use, I think there is no problem (atleast for now).
Comment 28 Lakshmi P Shanmugam CLA 2019-04-19 04:32:50 EDT
(In reply to Nikita Nemkin from comment #23)
> (In reply to Lakshmi Shanmugam from comment #22)
> > @Nikita, 
> > https://git.eclipse.org/r/#/c/140108/ - only reverts the changes in this
> > bug. 
> > But, according to comment#21, another fix is required for completion. Do you
> > agree? Do you plan to provide the fix?
> 
> I plan to remove all NSString.stringWithFormat calls from SWT. There are
> only a few (in Display code). This should fix Alexei's crashes on Avian.
> 
Opened Bug 546585 to track this.

(In reply to Eclipse Genie from comment #26)
> New Gerrit change created: https://git.eclipse.org/r/140652
Moved the above gerrit to Bug 546585.
Comment 29 Lakshmi P Shanmugam CLA 2019-04-19 04:40:56 EDT
(In reply to Alexei Svitkine from comment #21)
> Okay, so looks like the offending call was +[NSString stringWithFormat:],
> which was being called via JNI objc_1msgSend__JJJJ on x86_64.
> 
> So the issue is more subtle than the original report. The original code was
> correct for most calls - but variadic calls themselves (like
> stringWithFormat) were being handled incorrectly.
> 
> So the right fix would be to revert the general case changes back to
> non-variadic calls, but introduce variadic calls for the cases that need
> them like stringWithFormat:.

Opened Bug 546586 to fix the JNI Generator.
Comment 30 Lakshmi P Shanmugam CLA 2019-04-24 05:27:54 EDT
Nikita's patch [1] will revert the initial changes for this bug.  https://git.eclipse.org/r/140652 already removed the variadic function calls in SWT code, so reverting the initial changes should not cause any problems.

I plan to push this ([1]) today as there were no other concerns/objections.

[1] - https://git.eclipse.org/r/#/c/140108/
Comment 31 Lakshmi P Shanmugam CLA 2019-04-24 08:47:35 EDT
(In reply to Lakshmi Shanmugam from comment #30)
> Nikita's patch [1] will revert the initial changes for this bug. 
> https://git.eclipse.org/r/140652 already removed the variadic function calls
> in SWT code, so reverting the initial changes should not cause any problems.
> 
> I plan to push this ([1]) today as there were no other concerns/objections.
> 
> [1] - https://git.eclipse.org/r/#/c/140108/

Merged the revert patch - https://git.eclipse.org/r/#/c/140108/ to master. Revert is tracked by Bug 546144.