Community
Participate
Working Groups
Text editor should determine which encoding to use during save. This could be: a) the encoding forced by the user: IFile#getCharset(false) b) the encoding corresponding to the BOM obtained when reading (see bug 58859) c) the encoding inferred from the current in-memory contents: IContentTypeManager#getContentDescriptionFor(Reader, fileName, new QualifiedName[] {IContentDescription.CHARSET}) d) the parent default charset: IFile#getParent()#getDefaultCharset() If there is no interest in preserving BOMs, b) is not necessary.
Rafael, wouldn't IFile#getCharset(true) do what you describe in comment 0?
*** Bug 63047 has been marked as a duplicate of this bug. ***
(sorry for the delay, we had an e-mail outage here) More or less, except that you want to check the contents that are in memory (editor buffer), not the ones in disk (which is what IFile.getCharset knows about). Imagine the scenario where the user decides to change the encoding attribute in the XML declaration (for instance, the user changes the encoding attribute to be ISO-8859-1 when it was originally UTF-8). Step c) above is related to that. If you don't support that, next time the file is opened there might be problems (the contents were encoded using UTF-8, but the file says it is ISO-8859-1, and ISO-8859-1 will be the encoding returned by IFile#getCharset()). Also, if you keep the BOM in the in-memory contents, b) and c) can be merged, and then you have to check both BYTE_ORDER_MARK and CHARSET properties in the resulting content description.
A provisional fix is in I200405271600 for this we had to study the internal implementation of Platform Core's IFile.getCharset() regarding the strategies to get the encoding and what's done in case of exceptions. Changes in Platform Core can break us since detecting the same encoding for reading and writing is essential. In order to fix this correctly we need the following API: Ideally we would be: ???.getCharset(Reader, String name, IContentDescription description); ???.getCharset(InputStream, String name, IContentDescription description); The description might be null. It is used to mirror the case of IFile having an explicit encoding. Currently this wins but maybe in the future the strategy might change and the content might override this. Therefore we need to pass this info along. The use case to pass this along is when we save a file under as a new one.
I didn't follow your last statement on the use case for passing in a content description. Is it just so we may now the byte order mark used when the file was read (step "b" in the original description)?
I meant: "...we may *know*..."
Daniel, I didn't notice you were not in the CC list... please see comment 5 above.
forgot to add myself - thanks If I ask IFile.getCharset() there should be no need for me to know whether you first check the explicit file's charset or use some other strategy. Now assume I want to save that file under a different name. Since I cannot simply copy the content description from original file to the new one (setCharset not allowed for nonexistent files) I need to pass this to the method which computes the content description: maybe you return it directly, maybe you use another strategy.
Daniel, if I understood it well, you have an workaround right now that fix the problem, right? In that case, it is just too late to fix this, so we will have to postpone for post-3.0. Let me know if it is not the case.
yes, we do have a workaround.
See also bug 76298.
Daniel, I am sketching an API for this, and have some questions. The API look like this: String org.eclipse.core.resources.IFile.getCharsetFor(Reader contents) throws CoreException Basic behavior: 1) if the file exists, and has an explicit charset setting, return it 2) submit the contents to content description 3) if a content description was produced (the file has a known contentn type) and encoding info could be found (e.g. XML's encoding attribute), return it 4) return the implicit charset setting (querying the parent resource) One thing this does not handle is BOMs (because the input is a Reader, not an input stream). This API is intended to be used by clients that need to figure out what encoding to use when converting from text to bytes, so it would not make sense to have an input stream based version (I could take a client provided BOM though). But I noticed that you *do* use an input stream in your ad-hoc implementation (ResourceTextFileBuffer/JavaTextFileBuffer) and rely on the content type support for BOM inspection. It seems you provide a custom input stream implementation (DocumentInputStream) that will return every char as a byte, assuming that ASCII is used in the part of the contents actually read). Isn't that an unsafe assumption? Does the stream you handle to the content type manager prepended by a BOM regardless the type of BOM?
So this does not get lost, this is what the implementation could look like. /* (non-Javadoc) * @see IFile#getCharset(Reader) */ public String getCharsetFor(Reader contents) throws CoreException { String charset; // non-existing resources default to parent's charset ResourceInfo info = getResourceInfo(false, false); int flags = getFlags(info); if (exists(flags, true)) // the file exists, look for user setting if ((charset = workspace.getCharsetManager().getCharsetFor(getFullPath(), false)) != null) // if there is a file-specific user setting, use it return charset; // tries to obtain a description from the contents provided IContentDescription description; try { IContentTypeManager contentTypeManager = Platform.getContentTypeManager(); description = contentTypeManager.getDescriptionFor(contents, getName(), new QualifiedName[] {IContentDescription.CHARSET}); } catch (IOException e) { String message = NLS.bind(Messages.resources_errorContentDescription, getFullPath()); throw new ResourceException(IResourceStatus.FAILED_DESCRIBING_CONTENTS, getFullPath(), message, e); } if (description != null) if ((charset = description.getCharset()) != null) // the description contained charset info, we are done return charset; // could not find out the encoding based on the contents... default to parent's return workspace.getCharsetManager().getCharsetFor(getFullPath().removeLastSegments(1), true); }
>(I could take a client provided BOM though) That's one of the reasons for the content description suggested in comment 4. IFile.getCharsetFor(Reader contents) is good enough (and not content description parameter is needed since you can get it via IFile) for existing files that are saved on itself but we need a more powerful API for the SaveAs case where there's either an untitled document or an IFile which might be stored under a different name. Therefore we need to be able to pass in the content description. >Isn't that an unsafe assumption? Does the API method getDescriptionFor(InputStream,...) expect another kind of input stream or do simply want to say it would be better to use the Reader based API? >Does the stream you handle to the content type >manager prepended by a BOM regardless the type of BOM? Sorry but I do not understand that question.
I filed a bug 84903 to revert the input stream to a reader.
> That's one of the reasons for the content > description suggested in comment 4. Is there another one? I mean, will we need anything else from the content description other than the BOM property? Otherwise I would rather only take the BOM. >Does the API method getDescriptionFor(InputStream,...) expect >another kind of input stream or do simply want to say it >would be better to use the Reader based API? The IContentTypeManager APIs take any input stream. I was just pointing out the fact that DocumentInputStream breaks the InputStream contract by returning integers outside the range [-1, 255]. Theoretically, this may lead to incorrect results (since 16-bit chars will be narrowed to single byte values). This leads to my last question. I could not understand why a reader was not used instead. The only reason I could think of is that you wanted BOM detection, and BOM detection is not applicable to Readers, so you were forced to use an input stream. So, can the stream you handle to the content type manager API contain a BOM?
So I am planning to add this API to IFile: public String getCharsetFor(Reader contents) throws CoreException When creating a new file, if the client has a BOM it wants to use, the encoding to be used is the one corresponding to the BOM mark (no need to call the new API). When updating an existing file, if the client has a BOM it wants to use, first it has to check whether the file has a hard-encoding (set by the user) with getCharset(false). If that returns a non-null string, that is the encoding to be used. If it returns null, the encoding to be used is the one corresponding to the BOM mark (no need to call the new API). If the client does not have a BOM at hand, then it should just call the new getCharsetFor(contents). Daniel, please let me know whether the new API attends your needs, and I will release it.
Sorry, I was busy shipping M5. Let me first comment on comment 16: >Is there another one? I mean, will we need anything else from the content >description other than the BOM property? Otherwise I would rather only take the >BOM. The idea of the new API is to hide (or better provide) whatever happens in IFile.getCharset(). I only see a problem on relying on the BOM if IFile.getCharset() uses another (new) property from the content description in the future. >This leads to my last question. I could not understand why a reader was not >used instead. >So, can the stream you handle to the content type manager API contain a BOM? Yes, since it can also be a stream we get from File (for external files which are not IFile). Is it wrong to use the stream there? If so, what else should we use?
Does the API work if the file does not yet exist? Let me give some concrete scenarios: - save a new document into a folder F1 (which might have some encoding). I'd like to have an API where I can provide the path and the document/reader and get the charset back. If the API works for non-existent files then that will work, otherwise I'd need more API. Without the API I'd need to go know that I have to ask the folder for the encoding. - I have an existing file F1/f1.txt and save it as F1/f2.properties. Property files have a pre-defined encoding. Again, if I can use the new API for non-existent files it will work, otherwise I'd need API that can compute me the new charset. - save an existing file F1/f1.xml with user-defined encoding into F2 This is a usecase where I would need to pass in the new content and the existing encoding of f1.txt (that's the other reason for having the content description - I forgot about that in comment 18 - sorry). I could solve this manually by knowing that the user-defined charset wins but the main idea behind the new API is that I don't need to know this. Actually, the easiest for the SaveAs case would be if I could to do the following: 1. newFile= IFolder.getFile(String) 2. newFile.setCharset(oldFile.getCharset(false)) 3. newFile.getCharsetFor(Reader) But I know that this is most likely not possible since you do not want to store information for non-existent files. Hence I was thinking of API sketched in comment 4. The improved version would look like: ???.getCharset(Reader, IFile, IContentDescription) where - IFile does not need to exist - IContentDescription contains the charset and BOM of the original file (if any) I think ???.getCharset(InputStream, String name, IContentDescription) is not needed.
Re: comment 18 >>This leads to my last question. I could not understand >>why a reader was not used instead. >>So, can the stream you handle to the content type manager >> API contain a BOM? > Yes, since it can also be a stream we get from File (for >external files which are not IFile). Is it wrong to use the >stream there? If so, what else should we use? Once you have contents in text form, it does not make sense to keep the BOM in the contents. You might want to remember it somewhere else to insert them back when encoding the contents again during save (if you want to preserve it).
Re: comment 19 > Does the API work if the file does not yet exist? Yes. > - save an existing file F1/f1.xml with user-defined encoding into F2 This is similar to what I describe in comment 17 for saving an existing file. The general rule is: if you have an encoding you want to use (you want to preserve the original BOM or user-set encoding), just use it, don't call getCharset(Reader). This is simple enough for clients to figure out by themselves. If you agree with that, I hope you agree also that IFile.getCharset(Reader) should be enough for all of use cases discussed here.
>Once you have contents in text form, it does not make sense to keep the BOM in >the contents. You might want to remember it somewhere else to insert them back >when encoding the contents again during save (if you want to preserve it). Sure - we already do that. The fact that I can use the API on non-existing IFile is perfect (I was always assuming it won't be allowed) and fits my needs. The request to allow passing/setting the charset is not absolutely needed it would have been easier for us to transfer it from an existing file to a non-existing one but it's really not a must have.
Fixed. Released to HEAD. Test case added to CharsetTest. Daniel, please let me know if you have any concerns/find any problems. If the API ends up not being useful, there will be no point in adding it.
Rafeael, I filed bug 86545 to actually use the new API and provide feedack. Thanks for providing the API.