Bug 275467

Summary: Batch compiler writes log using default encoding instead of UTF-8
Product: [Eclipse Project] JDT Reporter: Nick Edgar <n.a.edgar>
Component: CoreAssignee: Olivier Thomann <Olivier_Thomann>
Status: VERIFIED FIXED QA Contact:
Severity: minor    
Priority: P3 CC: david_audel, Olivier_Thomann
Version: 3.3Flags: david_audel: review+
Target Milestone: 3.5 RC1   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
Proposed fix none

Description Nick Edgar CLA 2009-05-08 10:39:57 EDT
- invoke JDT compiler in Chinese locale with -log option
- log file header indicates UTF-8 encoding
- try to read file in UTF-8 (e.g. using Jazz's jdtCompileLogPublisher Ant task)
- it fails complaining:
  XML parsing error: "1 字节 UTF-8 序列的无效字节 1。" at line "2".
which roughly translates as  1 byte UTF-8 sequence's invalid byte 1"

Line 2 is the date line, e.g. the first few lines of a log in English are:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 07/05/09 1:31:24 EDT PM -->
<!DOCTYPE compiler PUBLIC "-//Eclipse.org//DTD Eclipse JDT 3.2.003 Compiler//EN" "http://www.eclipse.org/jdt/core/compiler_32_003.dtd">
<compiler copyright="Copyright IBM Corp 2000, 2007. All rights reserved." name="Eclipse Java Compiler" version="0.780_R33x, 3.3.1">
        <command_line>

The problem appears to be in Main$Logger.setLog where it does:
  this.log = new PrintWriter(new FileOutputStream(logFileName, false));

This uses the default encoding.  It should instead use 
  this.log = new PrintWriter(new FileOutputStream(logFileName, Util.UTF_8));


This is blocking TVT testing of Jazz, but we are looking into a workaround, e.g. specifying -Dfile.encoding=UTF-8 on the command line.
Comment 1 Nick Edgar CLA 2009-05-08 10:43:08 EDT
Note: I'd expect the date line to be in the locale-specific format, which would likely use double-bytes in the Chinese ('zh') locale.

I also noticed that the line that writes the date:
  this.log.println("<!-- " + new String(dateFormat.format(date).getBytes(), Util.UTF_8) + " -->");//$NON-NLS-1$//$NON-NLS-2$
converts between encodings incorrectly: it's converting to bytes using the default encoding, then back to a string using UTF-8.  There's no need for this conversion.  it should just do:
  this.log.println("<!-- " + dateFormat.format(date) + " -->");//$NON-NLS-1$//$NON-NLS-2$

Comment 2 Nick Edgar CLA 2009-05-08 10:58:46 EDT
Hm, the use of the default encoding for the PrintWriter might not be the problem.
The name of the log file we're using (in the Ant script) is declared as:
  <property name="compileLog" value="${java.io.tmpdir}/compilelog.xml"/>

The Logger code tries to handle XML files differently:
int index = logFileName.lastIndexOf('.');
if (index != -1) {
	if (logFileName.substring(index).toLowerCase().equals(".xml")) { //$NON-NLS-1$
		this.log = new GenericXMLWriter(new OutputStreamWriter(new FileOutputStream(logFileName, false), Util.UTF_8), Util.LINE_SEPARATOR, true);

which looks good to me.

We're invoking the Ant javac task with:
    <javac destdir="${build.output}"
       failonerror="false"
       debug="on"
       debuglevel="2"
       includes="**/*.java, *.java"
       srcdir="${workingDir}">
    	<compilerarg line="-log ${compileLog}"/>
    </javac>

It may be that the expansion of ${java.io.tmpdir} is confusing things (though it works OK for me on WinXP in English Canada locale). 

I'll dig further.
Comment 3 Nick Edgar CLA 2009-05-08 12:13:29 EDT
Turns out we were running an older version of the compiler (from 3.3).
Looks like the main issue with the encoding was fixed in 3.4.
Earlier versions (3.2 and 3.3) use the default encoding:
  this.log = new GenericXMLWriter(new FileOutputStream(logFileName, false), Util.LINE_SEPARATOR, true);

You might still want to consider the minor issue in comment 1.
Comment 4 Olivier Thomann CLA 2009-05-08 13:54:45 EDT
Reduce to minor as the problem is only with the date encoding.
Comment 5 Olivier Thomann CLA 2009-05-08 13:57:36 EDT
Created attachment 134997 [details]
Proposed fix
Comment 6 Olivier Thomann CLA 2009-05-08 13:58:01 EDT
Patch fixes problem mentionned in comment 1.
David, please review.
Comment 7 David Audel CLA 2009-05-11 07:25:48 EDT
Patch looks good.
Comment 8 Olivier Thomann CLA 2009-05-11 08:33:10 EDT
Released for 3.5RC1.
Code verification is required in order to verify this fix.
Comment 9 Kent Johnson CLA 2009-05-14 10:56:52 EDT
Verified using I20090513-2000