Bug 13939 - DBCS: no error message to DBCS whitespace in java source
Summary: DBCS: no error message to DBCS whitespace in java source
Status: CLOSED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 2.0   Edit
Hardware: PC Windows All
: P2 major (vote)
Target Milestone: 2.0 M6   Edit
Assignee: Olivier Thomann CLA
QA Contact:
URL:
Whiteboard:
Keywords: nl
Depends on:
Blocks:
 
Reported: 2002-04-16 16:11 EDT by Masayuki Fuse CLA
Modified: 2002-10-21 00:26 EDT (History)
4 users (show)

See Also:


Attachments
Test case java appl. zip file that includes DBCS blank for a field (493 bytes, application/octet-stream)
2002-04-17 16:03 EDT, Masayuki Fuse CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Masayuki Fuse CLA 2002-04-16 16:11:00 EDT
if java source includes invalid character in field or method name, no error 
message was shown and even if run the application. The invalid character error 
was shown at javadoc generation. I'd recommend that the error should be shown 
at the editor on the fly. 

STEPS
1) name for a field with DBCS blank in a java application
2) run the application is fine
3) in javadoc wizard for the program, invalid character message was shown 
during javadoc generation.
Comment 1 Nick Edgar CLA 2002-04-17 15:19:07 EDT
Which encoding is being used to store the file?
If you close and reopen the file does the character still appear correctly?
If so, it sounds like the Javadoc tool does not use the correct encoding.
Comment 2 Masayuki Fuse CLA 2002-04-17 16:00:12 EDT
I'm using Japanese MS932 encoding.
When I copied the java file from workspace and compiled by javac outside 
Eclipse, the compiler thrown an error message invalid character. I assume that 
Eclipse compiler filter out the invalid character, but not standalone javac and 
javadoc.
Comment 3 Masayuki Fuse CLA 2002-04-17 16:03:27 EDT
Created attachment 654 [details]
Test case java appl. zip file that includes DBCS blank for a field
Comment 4 Nick Edgar CLA 2002-04-18 14:13:21 EDT
Sounds like the parser should be more restrictive.
Since the file can be read OK, there must be extra checking going on in Javac 
and Javadoc.
Are there language rules for this?
Comment 5 Nick Edgar CLA 2002-04-18 21:55:18 EDT
Added comments from Fuse-San:

Here is information from Ken Borgendale who is working for i18n of Java. 

Java names must start with a char for which Character.isJavaIdentifierStart() 
returns true, and must continue with a char for which 
Character.isJavaItendifierPart() returns true.  The following is from the JDK 
1.4 documentation.  Note that JDK 1.4 is based on Unicode 3.0, and earlier 
versions are based on Unicode 2.1. 

A character may start a Java identifier if and only if one of the following 
conditions is true: 
isLetter(ch) returns true 
getType(ch) returns LETTER_NUMBER 
ch is a currency symbol (such as "$") 
ch is a connecting punctuation character (such as "_"). 
A character may be part of a Java identifier if any of the following are true: 
it is a letter 
it is a currency symbol (such as '$') 
it is a connecting punctuation character (such as '_') 
it is a digit 
it is a numeric letter (such as a Roman numeral character) 
it is a combining mark 
it is a non-spacing mark 
isIdentifierIgnorable returns true 

Comment 6 Olivier Thomann CLA 2002-04-19 10:36:30 EDT
\u3000 is the invalid character. If I compile this code:

public class Test4 {
	String 
\u30c6\u30ad\u3000\u30c8\u30b9;
}

I got this error:
----------
1. ERROR in 
D:\temp\Test4.java (at line 2)
	String \u30c6\u30ad\u3000\u30c8\u30b9;
	                         
^^^^^^^^^^^^
Syntax error on token "??", ";", "," expected
----------
Compiled 4 lines in 2113 ms (1.8 
lines/s)
1 problem (1 error)

So something might be wrong with the encoding option.
Comment 7 Olivier Thomann CLA 2002-04-19 10:53:11 EDT
The bug can be detected only when the \u3000 is the last character of an identifier. If you put it in 
the middle, the compiler fails properly. I am investigating why the last position causes the 
problem.
Comment 8 Olivier Thomann CLA 2002-04-19 11:50:58 EDT
It makes sense that we accept \u3000 at the end of a field name. This is a whitespace and you can put as 
many whitespaces you want at the end of a field name. This is perfectly legal.
Here is an 
example:
public class Test4 {
	public static void main(String[] args) 
{
		System.out.println(Character.isWhitespace('\u3000'));
	}
}
Use the following 
command line to convert this source into japanese characters.
native2ascii -reverse -encoding 
MS932 Test4.java Test5.java
Rename "Test4" in "Test5" in the Test5.java and then compile it 
with our compiler using the encoding MS932. when You run it, you got:
true

This means that 
\u3000 is seen as a whitespace and therefore can be at the end of a field name without causing a 
compile error. Javac fails to compile a source if the field name ends with \u3000.

\u3000 is a 
full width white space. Then I don't see anything wrong by adding such a character at the end of a 
field name.

The following test case returns false when we check that \u3000 is a valid part of a 
java identifier.
public class Test4 {
	public static void main(String[] args) 
{
		System.out.println(Character.isJavaIdentifierPart('\u3000'));
	}
}

Then 
the field name should stop at this character. The whitespace is consumed and the next token is the 
semi-colon. So the source is valid.
Looking at this bug report in the Sun database 
http://developer.java.sun.com/developer/bugParade/bugs/4660036.html, the \u3000 
should not be treated as a valid whitespace. But then all API in the Character class are boggus. We 
use them to determine if a identifier is a valid identifier.
This bug report shows 
inconsistency in the Sun code about what a whitespace is. 
http://developer.java.sun.com/developer/bugParade/bugs/4080617.html. They suggest to 
consider the result of the method Character.isWhitespace(char) to find out what a whitespace 
is.
Then this is not a bug and we can close this PR as INVALID.

Comment 9 Philipe Mulet CLA 2002-04-19 13:15:13 EDT
As per specs we are right, as long as the Character.isWhitespace is right.
Javac doesn't seem to be using it, and therefore isn't consistent.

Comment 10 Masayuki Fuse CLA 2002-04-19 14:15:09 EDT
I agree that DBCS blank should not use for Java identifier. However user 
sometimes erroneously input DBCS blank in any place. I'd recommend that Eclipse 
editor displays the error for the DBCS blank at free area.
I've just checked that VAJ throws an invalid character error when saving a 
file, if the DBCS blank is included. I'd prefer same behavior in eclipse.  
Comment 11 Olivier Thomann CLA 2002-04-19 14:31:21 EDT
Then I guess this is a platform UI problem. JDT/Core is responsible for compiling java source 
code. In this case this code is valid. As long as there is no consistency in the way the Character API 
is handling the \u3000 character we cannot improve our behavior for the compiler.
Comment 12 Nick Edgar CLA 2002-04-22 10:28:12 EDT
The choice of which characters the compiler considers to be whitespace or not 
is a compiler issue, and cannot be addressed by Platform UI, which does not 
even know about Java tooling.

To summarize the issues:
- all other known Java parsers (including Javac, Javadoc and JBuilder) consider 
the DBCS blank to be an invalid character
- we allow it
- according to the specs, our behaviour is the correct one
- it will be a fairly common occurrence for Japanese users (and possible users 
in other locales) to enter a DBCS blank instead of an ASCII space
- therefore, our tools will accept it but when the source is put through other 
tools like Javadoc, they will get errors
- since JDT's Javadoc generation tool hits this error, we look inconsistent 
since we accept it in one place but not another

Fuse-San characterizes this as a moderate usability problem (not severe, but 
not low either).

So I see 4 options:
1. make our parser work the same as others (and against the spec)
2. accept it, but generate a warning (either in the parser or at JDT UI level)
3. Fix our Javadoc (and any other tools) to consistently accept DBCS blank
4. do nothing and close this PR as WONTFIX.

Clearly option 1 is not favoured.
Moving to JDT UI for consideration of the other options.
Comment 13 Masayuki Fuse CLA 2002-04-25 15:42:34 EDT
should be fixed in 2.0 release or by its rollup.
Comment 14 Erich Gamma CLA 2002-08-06 06:36:00 EDT
Regarding option 2:
The editor shows the warnings/errors it gets from the reconciler/compiler. It 
is not doable to add another parsing pass in the editor to only detect this 
particular case and treat it as a warning. 

Therefore Option 2a: 
The compiler provides an option for whether this case should be a 
warning/error/ignored. This option could then be surfaced as a preference.

Regarding option 3:
we support to use any javadoc tool that comes with a JDK, so this isn't doable.

Moving to JDT CORE for commenting on 2a
Comment 15 Philipe Mulet CLA 2002-10-02 09:32:06 EDT
Apparently, we should not be using Character.isWhiteSpace, which implementation 
doesn't even follow its own spec (javadoc). The DBCS blank should indeed not be 
treated as a whitespace.

According to the mentionned defects from javac (see links from Olivier's 
comment), nothing will change from their side, they won't fix the 
implementation since it may break some existing user.

So, we will have to write our own whitespace check, and actually the JLS2 
doesn't advocate we use Character.isWhitespace (like it does for identifier 
parts).

We simply want the DBCS blank to be treated as '#', an invalid character in 
Java.

Proposed solution is to define Scanner.isWhiteSpace, and use it in place of 
Character.isWhiteSpace. Also we want to tune it to match competitors (the JLS 
is fairly restrictive on what it considers as a whitespace).
Comment 16 Olivier Thomann CLA 2002-10-02 10:51:51 EDT
It should not be an option to reject such a code. An illegal character is an
illegal character and it should not be an option. Our problem comes from the
fact that the isWhitespace(char) method on java.lang.Character doesn't follow
the specs. Then we failed because we rely on its result. I am investigating a
fix based on a new implementation of this method in the scanner class.
Comment 17 Olivier Thomann CLA 2002-10-03 09:21:39 EDT
THe DBCS blanks are now invalid characters. This is NOT optional.
Fixed and released in 2.1 stream.
Comment 18 Masayuki Fuse CLA 2002-10-21 00:26:56 EDT
Verified with the 2.1 M2.