Bug 356746

Summary:

ECJ accepts illegal unicode escape sequences

Product:

[Eclipse Project] JDT

Reporter:

Andreas Kohn <andreas.kohn>

Component:

Core

Assignee:

Olivier Thomann <Olivier_Thomann>

Status:

VERIFIED FIXED

QA Contact:

Severity:

normal

Priority:

CC:

amj87.iitr, kazm, Olivier_Thomann, stephan202

Version:

3.7

Target Milestone:

3.8 M2

Hardware:

OS:

Linux

Whiteboard:

Attachments:

Description	Flags
Input file	none
Input file (binary)	none
Proposed fix	none

Description Andreas Kohn

2011-09-05 13:50:54 EDT

Build Identifier: I20110803-1800

The following snippet compiles with ECJ (Eclipse Compiler for Java(TM) 0.C02, 3.8.0 M1, Copyright IBM Corp 2000, 2011. All rights reserved.), and leads to an error with Oracle (javac 1.7.0_02-ea):

public class Test {
	public static final String ERROR = "\u000Ⅻ";
}


verbose output of Oracle javac:
---
[parsing started RegularFileObject[src/Test.java]]
src/Test.java:32: error: illegal unicode escape
	public static final String ERROR = "\u000Ⅻ";
	                                         ^
[parsing completed 18ms]
[total 45ms]
1 error
---

Verbose output of ECJ:
---
[parsing    src/Test.java - #1/1]
[reading    java/lang/Object.class]
[analyzing  src/Test.java - #1/1]
[reading    java/lang/String.class]
[writing    Test.class - #1]
[completed  src/Test.java - #1/1]
[1 unit compiled]
[1 .class file generated]
---

Reproducible: Always

Comment 1 Andreas Kohn

2011-09-05 13:57:45 EDT

Created attachment 202771 [details]
Input file

Attach the actual input file. 

The bug here (in my understanding) is that JLS explicitly says which characters are allowed for a unicode escape:

--- http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.3
3.3 Unicode Escapes
Implementations first recognize Unicode escapes in their input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) with the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters:


    UnicodeInputCharacter:
            UnicodeEscape
            RawInputCharacter

    UnicodeEscape:
            \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

    UnicodeMarker:
            u
            UnicodeMarker u

    RawInputCharacter:
            any Unicode character

    HexDigit: one of
            0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The \, u, and hexadecimal digits here are all ASCII characters.
---

Comment 2 Olivier Thomann

2011-09-05 14:40:40 EDT

Could you please attach the test case in binary format?
When I tried to compile it, I got:
c:\tests_sources>java -jar ecj-head.jar Test.java
----------
1. ERROR in Test.java (at line 32)
        public static final String ERROR = "\u000Ôà½";
                                            ^^^^^^
Invalid unicode
----------
1 problem (1 error)

Are you using a specific encoding ?

Comment 3 Andreas Kohn

2011-09-06 02:09:00 EDT

Created attachment 202780 [details]
Input file (binary)

The file was UTF-8 encoded, which is the default on my system (LANG=en_US.utf8 is set in the environment, and eclipse etc are configured to use it).

The interesting character in the file is U+216B ROMAN NUMERAL TWELVE (which is a digit according to Character#isDigit() with the value 12)

Comment 4 Olivier Thomann

2011-09-06 08:18:52 EDT

Reproduced. I needed to pass -encoding UTF-8 to reproduce the issue.
Fix is trivial.

Comment 5 Olivier Thomann

2011-09-08 10:19:44 EDT

Created attachment 202994 [details]
Proposed fix

Comment 6 Olivier Thomann

2011-09-08 10:19:59 EDT

Released for 3.8M2.

Comment 7 Ayushman Jain

2011-09-12 17:31:26 EDT

Verified for 3.8M2 using org.eclipse.jdt.core_3.8.0.v_C09.jar