Bug 275467 - Batch compiler writes log using default encoding instead of UTF-8
Summary: Batch compiler writes log using default encoding instead of UTF-8
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.3   Edit
Hardware: PC Windows XP
: P3 minor (vote)
Target Milestone: 3.5 RC1   Edit
Assignee: Olivier Thomann CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-08 10:39 EDT by Nick Edgar CLA
Modified: 2009-05-14 10:56 EDT (History)
2 users (show)

See Also:
david_audel: review+


Attachments
Proposed fix (1.34 KB, patch)
2009-05-08 13:57 EDT, Olivier Thomann CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Edgar CLA 2009-05-08 10:39:57 EDT
- invoke JDT compiler in Chinese locale with -log option
- log file header indicates UTF-8 encoding
- try to read file in UTF-8 (e.g. using Jazz's jdtCompileLogPublisher Ant task)
- it fails complaining:
  XML parsing error: "1 字节 UTF-8 序列的无效字节 1。" at line "2".
which roughly translates as  1 byte UTF-8 sequence's invalid byte 1"

Line 2 is the date line, e.g. the first few lines of a log in English are:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 07/05/09 1:31:24 EDT PM -->
<!DOCTYPE compiler PUBLIC "-//Eclipse.org//DTD Eclipse JDT 3.2.003 Compiler//EN" "http://www.eclipse.org/jdt/core/compiler_32_003.dtd">
<compiler copyright="Copyright IBM Corp 2000, 2007. All rights reserved." name="Eclipse Java Compiler" version="0.780_R33x, 3.3.1">
        <command_line>

The problem appears to be in Main$Logger.setLog where it does:
  this.log = new PrintWriter(new FileOutputStream(logFileName, false));

This uses the default encoding.  It should instead use 
  this.log = new PrintWriter(new FileOutputStream(logFileName, Util.UTF_8));


This is blocking TVT testing of Jazz, but we are looking into a workaround, e.g. specifying -Dfile.encoding=UTF-8 on the command line.
Comment 1 Nick Edgar CLA 2009-05-08 10:43:08 EDT
Note: I'd expect the date line to be in the locale-specific format, which would likely use double-bytes in the Chinese ('zh') locale.

I also noticed that the line that writes the date:
  this.log.println("<!-- " + new String(dateFormat.format(date).getBytes(), Util.UTF_8) + " -->");//$NON-NLS-1$//$NON-NLS-2$
converts between encodings incorrectly: it's converting to bytes using the default encoding, then back to a string using UTF-8.  There's no need for this conversion.  it should just do:
  this.log.println("<!-- " + dateFormat.format(date) + " -->");//$NON-NLS-1$//$NON-NLS-2$

Comment 2 Nick Edgar CLA 2009-05-08 10:58:46 EDT
Hm, the use of the default encoding for the PrintWriter might not be the problem.
The name of the log file we're using (in the Ant script) is declared as:
  <property name="compileLog" value="${java.io.tmpdir}/compilelog.xml"/>

The Logger code tries to handle XML files differently:
int index = logFileName.lastIndexOf('.');
if (index != -1) {
	if (logFileName.substring(index).toLowerCase().equals(".xml")) { //$NON-NLS-1$
		this.log = new GenericXMLWriter(new OutputStreamWriter(new FileOutputStream(logFileName, false), Util.UTF_8), Util.LINE_SEPARATOR, true);

which looks good to me.

We're invoking the Ant javac task with:
    <javac destdir="${build.output}"
       failonerror="false"
       debug="on"
       debuglevel="2"
       includes="**/*.java, *.java"
       srcdir="${workingDir}">
    	<compilerarg line="-log ${compileLog}"/>
    </javac>

It may be that the expansion of ${java.io.tmpdir} is confusing things (though it works OK for me on WinXP in English Canada locale). 

I'll dig further.
Comment 3 Nick Edgar CLA 2009-05-08 12:13:29 EDT
Turns out we were running an older version of the compiler (from 3.3).
Looks like the main issue with the encoding was fixed in 3.4.
Earlier versions (3.2 and 3.3) use the default encoding:
  this.log = new GenericXMLWriter(new FileOutputStream(logFileName, false), Util.LINE_SEPARATOR, true);

You might still want to consider the minor issue in comment 1.
Comment 4 Olivier Thomann CLA 2009-05-08 13:54:45 EDT
Reduce to minor as the problem is only with the date encoding.
Comment 5 Olivier Thomann CLA 2009-05-08 13:57:36 EDT
Created attachment 134997 [details]
Proposed fix
Comment 6 Olivier Thomann CLA 2009-05-08 13:58:01 EDT
Patch fixes problem mentionned in comment 1.
David, please review.
Comment 7 David Audel CLA 2009-05-11 07:25:48 EDT
Patch looks good.
Comment 8 Olivier Thomann CLA 2009-05-11 08:33:10 EDT
Released for 3.5RC1.
Code verification is required in order to verify this fix.
Comment 9 Kent Johnson CLA 2009-05-14 10:56:52 EDT
Verified using I20090513-2000