110576 – [encoding] Rename CU looses encoding for file which charset is determined by contents

Bug 110576 - [encoding] Rename CU looses encoding for file which charset is determined by contents

Summary: [encoding] Rename CU looses encoding for file which charset is determined by ...

Status:	VERIFIED FIXED

Alias:	None

Product:	JDT
Classification:	Eclipse Project
Component:	Core (show other bugs)
Version:	3.1.1
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	3.2 M3
Assignee:	Frederic Fusier
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-09-26 06:13 EDT by Martin Aeschlimann
Modified:	2005-10-31 06:15 EST (History)
CC List:	4 users (show)

See Also:

Attachments
Patch to fix this issue (5.49 KB, patch) 2005-09-29 11:23 EDT, Frederic Fusier	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Aeschlimann

2005-09-26 06:13:54 EDT

3.1.1

- Open notepad, enter some umlauts to the text and save with 'Save As', UTF-8,
file name 'X.txt'
- Drag the file into a workspace; check file properties to see that encoding has
been detected: UTF 8, with BOM
- Use 'Rename' (-> resource rename) to rename the file to 'X.java'
- File properties still show the file as UTF-8
- Use 'Rename' (-> refactoring rename) to rename file to 'Y.java'
Now the file-encoding is CP1252

Comment 1 Martin Aeschlimann

2005-09-26 11:32:43 EDT

markus, is this us or jdt.core?

Comment 2 Markus Keller

2005-09-28 06:05:40 EDT

This is core. We just call ICompilationUnit#rename(..).

Note that the problem only occurs when the CU is in a non-default package. I
guess the rewrite of the package declaration does not set the encoding, and thus
the BOM is lost.

Comment 3 Frederic Fusier

2005-09-28 06:49:15 EDT

Cannot reproduce using build M20050923-1430.
Note that this issue was a duplicate of bug 66898. So it's sounds normal that it
does not longer occur in this build (there's a regression test in EncodingTests).

*** This bug has been marked as a duplicate of 66898 ***

Comment 4 Martin Aeschlimann

2005-09-28 07:01:42 EDT

I still have this in M20050923-1430.

Make sure that the file is in a package (not the default package)

Comment 5 Frederic Fusier

2005-09-28 07:12:23 EDT

Agreed, that was the missing info to reproduce it...

Comment 6 Frederic Fusier

2005-09-28 07:13:20 EDT

Sorry, in fact I missed it in comment 2...

Comment 7 Frederic Fusier

2005-09-28 10:21:00 EDT

Only happen for file which encoding is determined by content.
Also occur in 3.0

Comment 8 Frederic Fusier

2005-09-28 13:27:31 EDT

This problems comes from the fact that copy/move a resource in Java perspective
also change its contents.

Problem is not on resource charset but on its content description which is lost
during the operation. As Platform clients has no possibility to set this
description, I cannot figure out how we can fix that. I currently just have a
workaround which explicitely sets file charset to content description charset
value in this case which is not really a good solution...

I see that content description cache in ContentDescriptionManager uses timestamp
based on ResourceInfo.charSetAndContentId. This works well while only
moving/copying a resource or while only changing its contents, but it seems that
it does not work while doing both.

Perhaps is it related to bug 67606? I don't really know...

Move to Platform/Resources for comments and further investigations.

Comment 9 Rafael Chaves

2005-09-28 14:37:25 EDT

"This problems comes from the fact that copy/move a resource in Java perspective
also change its contents."

Frederic, before we look into the potential issue you mentioned, could you
clarify on this? How is that content change performed? Who does that? If
java.io.Readers/Writers are used, that would explain the loss of a BOM.

Comment 10 Frederic Fusier

2005-09-28 16:31:31 EDT

Changes is done using an org.eclipse.jdt.core.dom.rewrite.ASTRewrite and saved
using org.eclipse.jdt.internal.core.Buffer.

Put a breakpoint in method processCompilationUnitResource(ICompilationUnit,
PackageFragment) of org.eclipse.jdt.internal.core.CopyResourceElementsOperation
class.

You'll see first copy/move of resource associated with ICompilationUnit, then
after, in method saveContent(PackageFragment,String,ASTRewrite,String,IFile),
change of code and save using method save(IProgressMonitor,boolean) of
ICompilationUnit which calls corresponding save(IProgressMonitor,boolean) method
of Buffer.

In this method, this is the IFile which is used to set new contents, no direct
java.io access is done.

Comment 11 Rafael Chaves

2005-09-28 17:23:33 EDT

Thanks for the pointers. I know what is going on. 

The encoding information is not lost during the file rename
(CopyResourceElementsOperation#processCompilationUnitResource). It is lost later
during Buffer#save(...). The following code will cause the BOM to be lost:

byte[] bytes = encoding == null ? 
  stringContents.getBytes() : 
  stringContents.getBytes(encoding);

"stringContents" will be encoded as UTF-8 but the BOM is lost now
(String.getBytes("UTF-8") does not produce a BOM). And the problem is that there
is no supported way in Java for creating an output stream that has a UTF-8 BOM
(UTF-8 BOMs are optional). The text/Java editors, for instance, explicitly write
a BOM out before writing the file contents.

If you want to do something along those lines (only when the encoding is UTF-8),
you can use this to find if the file has a BOM:

IContentDescription.getProperty(IContentDescription.BYTE_ORDER_MARK)

And then you can write those bytes to the byte array output stream before
writing the encoded string contents.

Reassigning to Frederic as requested.

Comment 12 Frederic Fusier

2005-09-28 17:41:25 EDT

Cristal clear now :-)
Thanks Rafael for the help

Comment 13 Frederic Fusier

2005-09-29 11:23:10 EDT

Created attachment 27679 [details]
Patch to fix this issue

Apply same patch in Buffer.save than Platform/Text does in
ResourceTextFileBuffer.commitFileBufferContent...

Comment 14 Frederic Fusier

2005-09-30 09:13:11 EDT

Patch released in HEAD.

Test case added in EncodingTests

Comment 15 Jerome Lanneluc

2005-10-31 06:15:27 EST

Verified for 3.2 M3 using build I20051031-0010