112223 – Scanner#getNextToken() behavior doesn't seems consistent if there is an unicode inside a string.

Bug 112223 - Scanner#getNextToken() behavior doesn't seems consistent if there is an unicode inside a string.

Summary: Scanner#getNextToken() behavior doesn't seems consistent if there is an unic...

Status:	VERIFIED FIXED

Alias:	None

Product:	JDT
Classification:	Eclipse Project
Component:	Core (show other bugs)
Version:	3.2
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	3.2 M3
Assignee:	Olivier Thomann
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-10-11 11:06 EDT by David Audel
Modified:	2006-04-14 13:08 EDT (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Audel

2005-10-11 11:06:20 EDT

build I20050928-1300 + jdtcore head

In the method Scanner#getNextToken() at line 1212, the lookahead loop haven't
the same behavior with "\u004Ca\n" as with "\u004Ca\n\u0022 (\u004C=Z \u0022=")
and that's seems strange. In both case an InvalidInputException is thrown but
currentPosition is not the same and i don't find why. The algorithm looks
strange but i didn't find a visible bug.

Another question is why this code is reach only if the first character is an
unicode ?

---------------------------------------------------------------
if (isUnicode) {
  int start = this.currentPosition;
  for (int lookAhead = 0; lookAhead < 50; lookAhead++) {
    if (this.currentPosition >= this.eofPosition) {
      this.currentPosition = start;
      break;
    }
    if (((this.currentCharacter = this.source[this.currentPosition++]) == '\\')
&& (this.source[this.currentPosition] == 'u')) {
      isUnicode = true;
      getNextUnicodeChar();
    } else {
      isUnicode = false;
    }
    if (!isUnicode && this.currentCharacter == '\n') {
      this.currentPosition--; // set current position on new line character
      break;
    }
    if (this.currentCharacter == '\"') {
      throw new InvalidInputException(INVALID_CHAR_IN_STRING);
    }
  }
} else {
  this.currentPosition--; // set current position on new line character
}
throw new InvalidInputException(INVALID_CHAR_IN_STRING);

Comment 1 David Audel

2005-10-11 11:38:50 EDT

I found a visible bug. With "a\u000Da" the end of the error is just before 'D'
and if the string start with an unicode ("\u004Ca\u000Da") the end of the error
is at '"'.

Comment 2 Olivier Thomann

2005-10-11 13:42:39 EDT

In fact the problem is worth than that.
public class X {
	public static void main(String[] args) {
		System.out.println("\u004Ca\u000D");
	}
}

Return a string literal not properly closed by a double-quote where I would
expect an error about an illegal character in a string literal.

Comment 3 Olivier Thomann

2005-10-11 14:07:23 EDT

In fact we convert the INVALID_CHAR_IN_STRING in an unterminated string error. I
would rather like to locate the invalid character in the string literal.
We could then get rid of the lookahead and report the error only against the
invalid character. When we hightlight the whole string literal, we don't help
the user to locate the error.
Philippe, any thought?

I do have a fix for the problem reported in comment 1.

Comment 4 Olivier Thomann

2005-10-11 15:03:48 EDT

Fixed and released in HEAD.
Regression tests added in
org.eclipse.jdt.core.tests.compiler.regression.ScannerTest.test042/43/44.
Fixed in both PublicScanner and internal scanner.
See bug 112246 for error reporting inside string literals.

Comment 5 Frederic Fusier

2006-04-14 13:08:07 EDT

Verified for 3.2 M3 using build I20051102-1600.