Bug 356746

Summary: ECJ accepts illegal unicode escape sequences
Product: [Eclipse Project] JDT Reporter: Andreas Kohn <andreas.kohn>
Component: CoreAssignee: Olivier Thomann <Olivier_Thomann>
Status: VERIFIED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: amj87.iitr, kazm, Olivier_Thomann, stephan202
Version: 3.7   
Target Milestone: 3.8 M2   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments:
Description Flags
Input file
none
Input file (binary)
none
Proposed fix none

Description Andreas Kohn CLA 2011-09-05 13:50:54 EDT
Build Identifier: I20110803-1800

The following snippet compiles with ECJ (Eclipse Compiler for Java(TM) 0.C02, 3.8.0 M1, Copyright IBM Corp 2000, 2011. All rights reserved.), and leads to an error with Oracle (javac 1.7.0_02-ea):

public class Test {
	public static final String ERROR = "\u000Ⅻ";
}


verbose output of Oracle javac:
---
[parsing started RegularFileObject[src/Test.java]]
src/Test.java:32: error: illegal unicode escape
	public static final String ERROR = "\u000Ⅻ";
	                                         ^
[parsing completed 18ms]
[total 45ms]
1 error
---

Verbose output of ECJ:
---
[parsing    src/Test.java - #1/1]
[reading    java/lang/Object.class]
[analyzing  src/Test.java - #1/1]
[reading    java/lang/String.class]
[writing    Test.class - #1]
[completed  src/Test.java - #1/1]
[1 unit compiled]
[1 .class file generated]
---

Reproducible: Always
Comment 1 Andreas Kohn CLA 2011-09-05 13:57:45 EDT
Created attachment 202771 [details]
Input file

Attach the actual input file. 

The bug here (in my understanding) is that JLS explicitly says which characters are allowed for a unicode escape:

--- http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.3
3.3 Unicode Escapes
Implementations first recognize Unicode escapes in their input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) with the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters:


    UnicodeInputCharacter:
            UnicodeEscape
            RawInputCharacter

    UnicodeEscape:
            \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

    UnicodeMarker:
            u
            UnicodeMarker u

    RawInputCharacter:
            any Unicode character

    HexDigit: one of
            0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The \, u, and hexadecimal digits here are all ASCII characters.
---
Comment 2 Olivier Thomann CLA 2011-09-05 14:40:40 EDT
Could you please attach the test case in binary format?
When I tried to compile it, I got:
c:\tests_sources>java -jar ecj-head.jar Test.java
----------
1. ERROR in Test.java (at line 32)
        public static final String ERROR = "\u000Ôà½";
                                            ^^^^^^
Invalid unicode
----------
1 problem (1 error)

Are you using a specific encoding ?
Comment 3 Andreas Kohn CLA 2011-09-06 02:09:00 EDT
Created attachment 202780 [details]
Input file (binary)

The file was UTF-8 encoded, which is the default on my system (LANG=en_US.utf8 is set in the environment, and eclipse etc are configured to use it).

The interesting character in the file is U+216B ROMAN NUMERAL TWELVE (which is a digit according to Character#isDigit() with the value 12)
Comment 4 Olivier Thomann CLA 2011-09-06 08:18:52 EDT
Reproduced. I needed to pass -encoding UTF-8 to reproduce the issue.
Fix is trivial.
Comment 5 Olivier Thomann CLA 2011-09-08 10:19:44 EDT
Created attachment 202994 [details]
Proposed fix
Comment 6 Olivier Thomann CLA 2011-09-08 10:19:59 EDT
Released for 3.8M2.
Comment 7 Ayushman Jain CLA 2011-09-12 17:31:26 EDT
Verified for 3.8M2 using org.eclipse.jdt.core_3.8.0.v_C09.jar