Bug 60636 - [encoding] determining the encoding to use when saving
Summary: [encoding] determining the encoding to use when saving
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.0   Edit
Hardware: PC Windows XP
: P3 enhancement (vote)
Target Milestone: 3.1 M6   Edit
Assignee: Rafael Chaves CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 63047 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-04-30 16:15 EDT by Rafael Chaves CLA
Modified: 2005-02-24 14:54 EST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rafael Chaves CLA 2004-04-30 16:15:16 EDT
Text editor should determine which encoding to use during save. This could be:

a) the encoding forced by the user: IFile#getCharset(false)
b) the encoding corresponding to the BOM obtained when reading (see bug 58859)
c) the encoding inferred from the current in-memory contents:
    IContentTypeManager#getContentDescriptionFor(Reader, fileName, new
QualifiedName[] {IContentDescription.CHARSET})
d) the parent default charset: IFile#getParent()#getDefaultCharset()

If there is no interest in preserving BOMs, b) is not necessary.
Comment 1 Dani Megert CLA 2004-05-21 11:21:23 EDT
Rafael, wouldn't IFile#getCharset(true) do what you describe in comment 0?
Comment 2 Dani Megert CLA 2004-05-21 11:54:48 EDT
*** Bug 63047 has been marked as a duplicate of this bug. ***
Comment 3 Rafael Chaves CLA 2004-05-21 16:58:52 EDT
(sorry for the delay, we had an e-mail outage here)

More or less, except that you want to check the contents that are in memory
(editor buffer), not the ones in disk (which is what IFile.getCharset knows
about). Imagine the scenario where the user decides to change the encoding
attribute in the XML declaration (for instance, the user changes the encoding
attribute to be ISO-8859-1 when it was originally UTF-8). Step c) above is
related to that. If you don't support that, next time the file is opened there
might be problems (the contents were encoded using UTF-8, but the file says it
is ISO-8859-1, and ISO-8859-1 will be the encoding returned by IFile#getCharset()).

Also, if you keep the BOM in the in-memory contents, b) and c) can be merged,
and then you have to check both BYTE_ORDER_MARK and CHARSET properties in the
resulting content description.
Comment 4 Dani Megert CLA 2004-05-27 13:12:54 EDT
A provisional fix is in I200405271600 for this we had to study the internal
implementation of Platform Core's IFile.getCharset() regarding the strategies to
get the encoding and what's done in case of exceptions. Changes in Platform Core
can break us since detecting the same encoding for reading and writing is
essential. In order to fix this correctly we need the following API:

Ideally we would be:
???.getCharset(Reader, String name, IContentDescription description);
???.getCharset(InputStream, String name, IContentDescription description);

The description might be null. It is used to mirror the case of IFile having an
explicit encoding. Currently this wins but maybe in the future the strategy
might change and the content might override this. Therefore we need to pass this
info along. The use case to pass this along is when we save a file under as a
new one.
Comment 5 Rafael Chaves CLA 2004-05-27 14:35:09 EDT
I didn't follow your last statement on the use case for passing in a content
description. Is it just so we may now the byte order mark used when the file was
read (step "b" in the original description)?
Comment 6 Rafael Chaves CLA 2004-05-27 14:46:07 EDT
I meant: "...we may *know*..."
Comment 7 Rafael Chaves CLA 2004-05-27 16:57:43 EDT
Daniel, I didn't notice you were not in the CC list... please see comment 5 above.
Comment 8 Dani Megert CLA 2004-05-28 04:53:45 EDT
forgot to add myself - thanks

If I ask IFile.getCharset() there should be no need for me to know whether you
first check the explicit file's charset or use some other strategy. Now assume I
want to save that file under a different name. Since I cannot simply copy the
content description from original file to the new one (setCharset not allowed
for nonexistent files) I need to pass this to the method which computes the
content description: maybe you return it directly, maybe you use another strategy.
Comment 9 Rafael Chaves CLA 2004-06-02 16:38:30 EDT
Daniel, if I understood it well, you have an workaround right now that fix the
problem, right? In that case, it is just too late to fix this, so we will have
to postpone for post-3.0.

Let me know if it is not the case.
Comment 10 Dani Megert CLA 2004-06-03 04:37:24 EDT
yes, we do have a workaround.
Comment 11 Rafael Chaves CLA 2005-01-28 16:25:47 EST
See also bug 76298.
Comment 12 Rafael Chaves CLA 2005-02-09 11:27:40 EST
Daniel, I am sketching an API for this, and have some questions. The API look
like this:

String org.eclipse.core.resources.IFile.getCharsetFor(Reader contents) throws
CoreException

Basic behavior:
1) if the file exists, and has an explicit charset setting, return it
2) submit the contents to content description
3) if a content description was produced (the file has a known contentn type)
and encoding info could be found (e.g. XML's encoding attribute), return it
4) return the implicit charset setting (querying the parent resource)

One thing this does not handle is BOMs (because the input is a Reader, not an
input stream). This API is intended to be used by clients that need to figure
out what encoding to use when converting from text to bytes, so it would not
make sense to have an input stream based version (I could take a client provided
BOM though). But I noticed that you *do* use an input stream in your ad-hoc
implementation (ResourceTextFileBuffer/JavaTextFileBuffer) and rely on the
content type support for BOM inspection. It seems you provide a custom input
stream implementation (DocumentInputStream) that will return every char as a
byte, assuming that ASCII is used in the part of the contents actually read).
Isn't that an unsafe assumption? Does the stream you handle to the content type
manager prepended by a BOM regardless the type of BOM?
Comment 13 Rafael Chaves CLA 2005-02-09 15:01:41 EST
So this does not get lost, this is what the implementation could look like.

/* (non-Javadoc)
 * @see IFile#getCharset(Reader)
 */
public String getCharsetFor(Reader contents) throws CoreException {
    String charset;        
    // non-existing resources default to parent's charset
    ResourceInfo info = getResourceInfo(false, false);
    int flags = getFlags(info);
    if (exists(flags, true))
        // the file exists, look for user setting
        if ((charset =
workspace.getCharsetManager().getCharsetFor(getFullPath(), false)) != null)
            // if there is a file-specific user setting, use it
            return charset;
    // tries to obtain a description from the contents provided
    IContentDescription description;
    try {
        IContentTypeManager contentTypeManager = Platform.getContentTypeManager();
        description = contentTypeManager.getDescriptionFor(contents, getName(),
new QualifiedName[] {IContentDescription.CHARSET});
    } catch (IOException e) {
        String message = NLS.bind(Messages.resources_errorContentDescription,
getFullPath());
        throw new ResourceException(IResourceStatus.FAILED_DESCRIBING_CONTENTS,
getFullPath(), message, e);
    }
    if (description != null)            
        if ((charset = description.getCharset()) != null)
            // the description contained charset info, we are done 
            return charset;
    // could not find out the encoding based on the contents... default to parent's
    return
workspace.getCharsetManager().getCharsetFor(getFullPath().removeLastSegments(1),
true);
}
Comment 14 Dani Megert CLA 2005-02-10 10:32:19 EST
>(I could take a client provided BOM though)
That's one of the reasons for the content description suggested in comment 4.

IFile.getCharsetFor(Reader contents) is good enough (and not content description
parameter is needed since you can get it via IFile) for existing files that are
saved on itself but we need a more powerful API for the SaveAs case where
there's either an untitled document or an IFile which might be stored under a
different name. Therefore we need to be able to pass in the content description.

>Isn't that an unsafe assumption?
Does the API method getDescriptionFor(InputStream,...) expect another kind of
input stream or do simply want to say it would be better to use the Reader based
API?

>Does the stream you handle to the content type
>manager prepended by a BOM regardless the type of BOM?
Sorry but I do not understand that question.
Comment 15 Dani Megert CLA 2005-02-10 11:08:40 EST
I filed a bug 84903 to revert the input stream to a reader.
Comment 16 Rafael Chaves CLA 2005-02-10 11:10:13 EST
> That's one of the reasons for the content 
> description suggested in comment 4.

Is there another one? I mean, will we need anything else from the content
description other than the BOM property? Otherwise I would rather only take the BOM.

>Does the API method getDescriptionFor(InputStream,...) expect
>another kind of input stream or do simply want to say it 
>would be better to use the Reader based API?

The IContentTypeManager APIs take any input stream. I was just pointing out the
fact that DocumentInputStream breaks the InputStream contract by returning
integers outside the range [-1, 255]. Theoretically, this may lead to incorrect
results (since 16-bit chars will be narrowed to single byte values).

This leads to my last question. I could not understand why a reader was not used
instead. The only reason I could think of is that you wanted BOM detection, and
BOM detection is not applicable to Readers, so you were forced to use an input
stream. So, can the stream you handle to the content type manager API contain a BOM?
Comment 17 Rafael Chaves CLA 2005-02-21 15:54:02 EST
So I am planning to add this API to IFile:

public String getCharsetFor(Reader contents) throws CoreException

When creating a new file, if the client has a BOM it wants to use, the encoding
to be used is the one corresponding to the BOM mark (no need to call the new API).

When updating an existing file, if the client has a BOM it wants to use, first
it has to check whether the file has a hard-encoding (set by the user) with
getCharset(false). If that returns a non-null string, that is the encoding to be
used. If it returns null, the encoding to be used is the one corresponding to
the BOM mark (no need to call the new API).

If the client does not have a BOM at hand, then it should just call the new
getCharsetFor(contents).

Daniel, please let me know whether the new API attends your needs, and I will
release it. 
Comment 18 Dani Megert CLA 2005-02-22 09:39:47 EST
Sorry, I was busy shipping M5. Let me first comment on comment 16:

>Is there another one? I mean, will we need anything else from the content
>description other than the BOM property? Otherwise I would rather only take the
>BOM.
The idea of the new API is to hide (or better provide) whatever happens in
IFile.getCharset(). I only see a problem on relying on the BOM if
IFile.getCharset() uses another (new) property from the content description in
the future.

>This leads to my last question. I could not understand why a reader was not 
>used instead.
>So, can the stream you handle to the content type manager API contain a BOM?
Yes, since it can also be a stream we get from File (for external files which
are not IFile). Is it wrong to use the stream there? If so, what else should we use?

Comment 19 Dani Megert CLA 2005-02-22 10:17:10 EST
Does the API work if the file does not yet exist? 

Let me give some concrete scenarios:

- save a new document into a folder F1 (which might have some encoding).
I'd like to have an API where I can provide the path and the document/reader and
get the charset back. If the API works for non-existent files then that will
work, otherwise I'd need more API. Without the API I'd need to go know that I
have to ask the folder for the encoding.

- I have an existing file F1/f1.txt and save it as F1/f2.properties.
Property files have a pre-defined encoding. Again, if I can use the new API for
non-existent files it will work, otherwise I'd need API that can compute me the
new charset.

- save an existing file F1/f1.xml with user-defined encoding into F2
This is a usecase where I would need to pass in the new content and the existing
encoding of f1.txt (that's the other reason for having the content description -
I forgot about that in comment 18 - sorry). I could solve this manually by 
knowing that the user-defined charset wins but the main idea behind the new API
is that I don't need to know this. Actually, the easiest for the SaveAs case
would be if I could to do the following:
1. newFile= IFolder.getFile(String)
2. newFile.setCharset(oldFile.getCharset(false))
3. newFile.getCharsetFor(Reader)

But I know that this is most likely not possible since you do not want to store
information for non-existent files. Hence I was thinking of API sketched in
comment 4. The improved version would look like:

???.getCharset(Reader, IFile, IContentDescription) where
- IFile does not need to exist
- IContentDescription contains the charset and BOM of the original file (if any)

I think ???.getCharset(InputStream, String name, IContentDescription) is not needed.
Comment 20 Rafael Chaves CLA 2005-02-22 13:04:16 EST
Re: comment 18

>>This leads to my last question. I could not understand 
>>why a reader was not used instead.
>>So, can the stream you handle to the content type manager 
>> API contain a BOM?
> Yes, since it can also be a stream we get from File (for 
>external files which are not IFile). Is it wrong to use the 
>stream there? If so, what else should we use?

Once you have contents in text form, it does not make sense to keep the BOM in
the contents. You might want to remember it somewhere else to insert them back
when encoding the contents again during save (if you want to preserve it).
Comment 21 Rafael Chaves CLA 2005-02-22 13:12:47 EST
Re: comment 19

> Does the API work if the file does not yet exist? 

Yes.

> - save an existing file F1/f1.xml with user-defined encoding into F2

This is similar to what I describe in comment 17 for saving an existing file.
The general rule is: if you have an encoding you want to use (you want to
preserve the original BOM or user-set encoding), just use it, don't call
getCharset(Reader). This is simple enough for clients to figure out by
themselves. If you agree with that, I hope you agree also that
IFile.getCharset(Reader) should be enough for all of use cases discussed here.
Comment 22 Dani Megert CLA 2005-02-24 03:22:49 EST
>Once you have contents in text form, it does not make sense to keep the BOM in
>the contents. You might want to remember it somewhere else to insert them back
>when encoding the contents again during save (if you want to preserve it).
Sure - we already do that.

The fact that I can use the API on non-existing IFile is perfect (I was always
assuming it won't be allowed) and fits my needs.

The request to allow passing/setting the charset is not absolutely needed it
would have been easier for us to transfer it from an existing file to a
non-existing one but it's really not a must have.
Comment 23 Rafael Chaves CLA 2005-02-24 12:31:11 EST
Fixed. Released to HEAD. Test case added to CharsetTest.

Daniel, please let me know if you have any concerns/find any problems. If the
API ends up not being useful, there will be no point in adding it.
Comment 24 Dani Megert CLA 2005-02-24 14:54:34 EST
Rafeael, I filed bug 86545 to actually use the new API and provide feedack.
Thanks for providing the API.