Community
Participate
Working Groups
Improve file encoding support. Eclipse 2.1 uses a single global file encoding setting for reading and writing files in the workspace. This is problematic; for example, when Java source files in the workspace use OS default file encoding while XML files in the workspace use UTF-8 file encoding. The Platform should support non-uniform file encodings. [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: User experience]
*** Bug 36950 has been marked as a duplicate of this bug. ***
Original PR: bug 5399.
Re: Non-uniform file encodings in the Eclipse Platform Many worthwhile ideas here. Other comments... 1. I assume in the "basic algorithm" steps are performed in order listed. In that case, steps 2 and 3 must be interchanged. The encoding interpreter must always be consulted first. Multiple encodings are possible with the same BOM. The result of (current) step 3 should be final. Otherwise, the BOM test should be ignored unless it is inconsistent with the result of step 4 or 5. 2. Encoding must be determined upon save as well as open. This determination may require calling an output encoding interpreter, which you do not have in your scheme. (Use case: User has an <?xml encoding declaration in an XML file and changes text of the encoding attribute.) The editor should not be required to track these changes character-by-character and blast off encoding change notifications. In fact, the editor may not be aware of encoding at all. (Use case: Rick Jellife has proposed an encoding declaration that would appear in comments at the beginning of a file.) Instead, an output encoding interpreter should be called at save time. IOW, the "basic algorithm" should be applied at save time, too, using an encoding interpreter that operates on the Unicode text instead of a byte stream. 3. In light of the above, notifying of encoding changes seems of limited value, since may be re-determined at open/save time. Encoding should be discovered when it is needed. Notification may be counter-productive, leading editors to take actions they should not be taking, like calling setCharset(). 4. setEncoding() should be removed and the basic algorithm should be the description of how getEncoding() works. setEncoding() is a potential source of problems. For example, if setEncoding() is called on an open resource and the resource is then saved and closed, the resource cannot be re-opened successfully unless the encoding set is remembered. This makes it a resource property, but there is already a resource property that may contain an encoding and the two may be in conflict. What is a valid use of setEncoding()? 5. It should be possible for an editor to have associated encoding interpreter(s), so that the user is not forced to set the encoding interpreter and the editor separately. It is highly likely that the user will not be aware of the encoding interpreter feature and will not correctly set it in advance of having encoding problems. In fact, users seem to have problems learning how to set editors associated with extensions, and they already know what an editor is. Likewise, editors should not have to establish their own encoding interpreters programmatically. 6. What is the use/purpose of isDefaultEncoding()? There may be several "defaults". If anyone cares that a resource is not using the workspace-level encoding, they should stop caring. 7. Workspace-level, resource-level and interpreter-determined are requirements, but I am not convinced there are use cases to support directory-level encoding, and they do add overhead. If the feature exists, someone may find it useful, if that's the threshhold.
I'll give a couple high level comments, though not an exhaustive list (just wanted to document some main things first). 1. I agree with Bob's comment that the BOM step should be done first, but my memory of the standards is that step should be final, if there is a BOM. That is, I thought the BOM was definitive. I'm not aware of cases where "Multiple encodings are possible with the same BOM". Bob, perhaps you can explain? 2. Encoding (interpreters) definitely needs to be associated with content-type, not file extension. That seems to be assumed, understood by everyone, but just wanted to add my voice to the importance of that. 3. While an "output" interpreter would also be necessary, I think some token remembering the encoding used during input (or the last output) is required too. For one thing, if a 3 byte BOM (for UTF-8) is detected during read, its only polite to maintain that when written. For a second thing, which does depend on the token being "kept up to date" with notifications, it is possible that someone can paste text into a document that "violates" the encoding. Some well behaved editors might want to give some warning about that. 4. I might be explicit that the above comment assumes EncodingMemento (my name for the token :) should be associated not only with the IFile/Resouce, but also the IDocument. It is, after all, possible to save/copy a document independently of its original resource. In those cases the encoding token should ride along. 5. Something that seems missing from the spec is the association of IANA encoding names and Java encoding names. I suggest this be provided at a "base" level since 1) there's some ambiguities, 2) its dependent on VM and platform, and 3) should be a "base" preference that allows users to control that association, when needed. (Most users don't need to do this, but some do, and for those that do, there's no alternative workaround). 6. I suspect there's a few well known interpreters that should be included as part of the base support: XML, at least. HTML, JSP, CSS also come to mind. Others? 7. The spec also doesn't mention how conflict resolution between interpreters is handled. I suspect that if the "well known" ones were included as base support, there'd be little need beyond a warning message in the log file, but if, for example, everyone needs to re-invent an XML interpreter, there'd be plenty of opportunity for conflict and users may desire a choice as to which was used (which would be unfortunate). 8. Also not well covered in the spec is exception handling. From experience, its easy for a file (e.g. an XML file) to specify an encoding but that some character(s) in the file don't actual use that encoding, and a "MalFormedInput" exception will be raised. Some mechanism is then needed to allow users and/or client code to "override" whatever the "default" behavior should be. For example, an editor might want to give the user a choice to "use default" or pick another encoding to try. 9. The above point reminds me that in Java 1.4 there's an encoding setting that allows different behavior on encoding errors during input. One option throws an exception, the other substitutes '?' for unreadable characters (well, actually, they say they make an attempt to "guess" what the character is, but I'm not sure that's very accurate). The point being that some parallel setting should exist with base Eclipse support, which would then "pass through" to the underlying Java support. There's a similar situation with encoding errors on output, but a little worse. Even with 1.3, invalid characters are written as '?' instead of automatically throwing an exception, and some care is needed to handle invalid characters on output. The degree of care, I suggest, should also be "settable" by client code.
David - This is a longish response, though I agree with most of your points. 1/3a. I have a general concern that Eclipse doesn't quite "get it" when it comes to plug-ins. In case after case, we find super-privileged plug-ins that are allowed to get in first and establish default behaviors by means that do not follow extension point rules and are difficult or impossible for other plug-ins to override. Much of 3.0 is an effort to correct this problem for menus and toolbars; the 2.1 cycle tried to do the same for keyboard shortcuts. Whenever I see some "default" behavior making irrevocable decisions without consulting plug-ins, I get nervous. For example, according to Unicode, the BOM is definitive and since the BOM was written by the last writer or agent, it probably reflects the current encoding. So it would be useful to see the encoding interpreter plug-in called with an argument that indicates the BOM encoding (or none), as well as the length of the BOM. There is no harm, either, in recording the BOM-determined encoding in a memento, provided the mememto clearly indicates the origin of the information. To carry this information around as some sort of vague "preferred encoding" would be counter-productive. When a document is modified and saved, or created and saved for the first time, the encoding indicated within the document, if any, must always be consulted. The precedence for determining the write encoding should be: 1. In-document encoding, if any. 2. Resource property encoding, if any. 3. Directory default encoding(s), if any. 4. Platform encoding If any of these is different than the input BOM encoding, the plug-in should ask the user to confirm or switch to the input encoding. The input encoding really has no privileged relationship to the output encoding, but there are two good reasons to ask: - One of the default encodings may lose information. - Users most often don't know what the platform default encoding is (at least, until they are screwed the first time). 2. Having done it this way in another product because I thought it was the right answer, I am keenly aware of two issues: - Users really don't understand content types and don't appreciate the added level of indirection (extension -> content type -> encoding interpreter/editor/whatever). Especially note the editor/whatever part. - If by content type you mean MIME type, it doesn't mean and doesn't uniquely identify "document type". E.g., text/xml and application/xml are the same document type with allegedly different encoding (but don't count on it). If you don't mean MIME type, then I don't know what you mean, but inventing a new document type naming convention is, as they say, fraught. 3a. Can't say that ignoring all of the user's encoding preferences and using instead an encoding the user can't see is "polite". Seems downright rude. 3b. The editor is going to dynamically track changes to the many encoding preferences, which requires resolving to the single relevant encoding preference each time, so that it can check the entire document and every document change between preference changes to make sure it is within that encoding? I don't think so. Or at least, I hope not. It might be valuable to check the document on save to ensure that information won't be lost - not needed very often, but certainly handy when it is, but that check can be non-trivial (and unfortunately is most useful when it is non-trivial) and is quite beyond the resources of most editor-writers. 4. I'm not sure that copying the contents of a document, which are encoding-neutral, has anything to do with how the copy is treated afterward. 5. I don't recall that Java is so perverse that it doesn't recognize any IANA names, nor that it gives a different interpretation to any. I certainly could be wrong, but that seems more like material for a bug report to Sun. What I do is present choices only in terms of IANA names but attempt to map any name provided first into a Java name otherwise to an IANA name, using maps that are not case-sensitive and accomodate common variations in punctuation, otherwise in a dialog I will flag an error; if an possibly invalid name appears in a document, I try the name anyway. I don't consider it possible to predict what names Java will accept in any given release. Then what to do if Java rejects a name? I could probably do this better, but here's what I currently do: If Java rejects a user-specified encoding on save, I save in UTF-8 and depend on the platform to write a BOM. If the platform co-operates, then no information is lost, the document is always readable and save always succeeds. I felt that save succeeding was more important than user notification (or trying to choose another preference, which might again result in user notification). To allow a user to do a Save All and walk away without seeing some dialog pop up, and later experience a power failure, seemed to me to be a hanging offense. That said, some user notification after the fact, and some assurance that the BOM is actually written, would improve matters. 6/7. All encoding interpreters should be treated equally; the platform-contributed ones should be just like any other. There needs to be a resultion algorithm, which was the source of my previous comment that if a user selects an editor for a document and the editor contributes an encoding interpreter, that interpreter ought to be used. It is really asking a bit much for users to deal with this as a separate issue, and I don't care that files can be opened by plug-ins that are not editors, this use case is important enough to get special treatment. As always, the resolution algorithm should take into account the important use cases. Open in editor and Search come readily to mind. (Maybe Search is simple: the contents should be obtained from already open documents, if possible, otherwise apply the algorithm.) 8. I agree for Open in editor the user should be given a choice on input if a specified encoding throws; I'm not so sure for Search; probably this should just go into the status messages and let the search continue without the file. Use cases rule. 9. For encoding, the default should be try/catch/recover for both input and output. Font/code page selection is also involved. The most common report I see is that the file is read correctly but the user sees boxes or garbage, and that's often a presentation problem. The current one-font-fits-all behavior doesn't really cut it, any more than one-encoding-fits-all does. It seems obvious that the correct presentation can, and should, be selected based on the encoding if it is, say, ASCII or SHIFT-JIS, but not obvious if it is UTF-8. But all this is platform-dependent and not one of my areas of expertise.
Actually, WRT 3b, the determination is trivial enough if you try the translation and it throws.
Thought I should add some comments on Bob's comments. Hopefully, we'll approach some clear statements of principles or uses cases, and then could decide if new encoding support can/will support those use cases. First, on 3b. ... I hate to be thought of as rude :) so I'll clarify I originally meant ... given a resource was read in as UTF8 (and it had a 3 byte BOM) then if the resource is written as UTF8, its only polite to also include the 3 byte BOM when written. But, if read in as UTF8 (and did not have 3 byte BOM) then if the resource is written as UTF8, it is only appropriate to not include the 3 byte BOM. And, my point was, the only way to know what to do about the 3 byte BOM is to "carry along" that "how read" info, such as in an EncodingMemento. At more of a "principle" level, I do think it important to carry along the whole encoding info that a resource was read with. And, I think there should be a rule that says "in the absence of over-riding information, a resource should be written as it was read. The "over-riding" part would be rules 1 and 2 in Bob's list, so I guess this would be a "rule 2.1", coming before "directory setting". The "use cases" in support of this principle are easy to find: a. If a resource is read in with unicode encoding due to some BOM (3 or 4 (or more?) bytes) AND it is not otherwise specified, then I think it should be written out the same way. I think that's the intent of the unicode standard BOM, though I don't know in practice how many people use this technique (since most 'modern files' would contain the encoding in the file itself (e.g. for XML, etc.). [Guess it would apply to Java files! :) b. Another use case, I've personally seen, is that there are some cases for HTML files (which is not so standard on encoding spec's) there are some Japanese systems which "peek" inside the file to determine encoding (peek in the sense of looking at byte patterns) and if one of those Japanese encodings is found, those users would expect (require!) it to be written the same way. And, this above use cases is the reason why its important to "carry along" the encoding info through to IDocument (or similar), and not just resources, so that if a resource from one of the above cases results in an IDocument, and that IDocument is cloned/copied/savedAs... some other resource, then it would have the same characteristics as the original. Just a quick note on the importance of associating encoding rules with "content type" ... the prima-dona use case for this is JSPs. Typically using a file extension of .jsp, there's no reason why any user can't change the settings on their web server, and want 'jst' to also be interpreted as a JSP file. They should then have an easy way to let the tool/platform know that 'jst' should be in the JSP "family" (is that a better term?) and then have everything in the platform that in some way handles .jsp files handle .jst files in the same way (not just editor association). Lastly, and what may be a different view between Bob and I, is that I do see encoding/decoding only working correctly, from a platform point of view, if the resolution algorithm always results in the same interpreter/ rule being used to do encoding/decoding. The reason for this is that there are so many functions (compilers, builders, validaters, databases, search indices, fixups, code generators, etc.) that all depend on the resource being "interpreted" the same way. Of course, this should not preclude some functions (e.g. editors) from changing the encoding/decoding, but I think this should be a separate "concept" from the "interpreter resolution". In fact, if I can end with a "wild idea" off the top of my head, maybe even editors should always assume the same encoding/decoding, but there be a convenient "encoding explorer" as part of the platform that would give an easy way to view files, determine effects of changing encodings (and fonts!), and save new interpretation. Well, just a thought. Thanks, hope these ramblings are clear enough so they can be distilled to the use-case or "encoding principle" level. (Glad I'm not writing the spec :)
See bug 3970 for a request to allow configurable line delimeters. If we allow this, it should be supported at the same granularity as the encoding settings (see Kai's comment).
A new revision of the improved file encoding proposal has been made available off the Platform/Core web page: http://dev.eclipse.org/viewcvs/index.cgi/%7Echeckout%7E/platform-core-home/dev.html#plan_current Comments are welcome and should be made on the Platform/Core development list (platform-core-dev@eclipse.org) or this PR.
While I appreciate the desire to simplify usage, I feel the suggested new plan (per-project encodings, with some attempt to discover encodings automatically) is a step backwards from the previous one (per-project, folder, and file encoding settings). First, there are good reasons to have documents with multiple encodings in a single project. For example, suppose I want to ship a product on Windows, Mac, and UNIX. I have README or other files with some requirements for non-ASCII characters. For Windows, I might want to use the Windows variant of Latin-1, CP 1252. For Mac, I might want to use Mac Roman. For UNIX, I might use ISO Latin-1. I might also have Japanese variants in Shift-JIS, Chinese in Big-5, and other Asian variants. These variants are necessary, because a large number of sites do not have Unicode system locales - many do not even have Unicode locales installed. Second, note that discovering an encoding by reading a file is difficult, and potentially expensive (for files not marked with Unicode BOMs). How much of a file would one read before deciding one knew the encoding? What if my README file contained 2K or more of 7-bit ASCII text before a section that contained some symbols, or some accented or Asian characters? Third, the proposal seems to assume that certain file types imply certain encodings - notably, that XML should be in UTF-8. This is not always the case. There are times when we want to use Latin-1 or even 7-bit ASCII for XML (using XML character references for all characters outside those ranges), for compatibility with transport mechanisms and older code that cannot handle UTF- 8. As an extreme case, consider sending an XML document through a legacy protocol that is not 8-bit clean.
Sorry for the long 'comment' ... this is the description of at contribution I'd like to propose to see if helpful. It represents (after much refactoring) some code taht we've been using to do encoding/decoding on previous products based on Eclipse 2.x stream. I've tried to keep is similar to current spec's and proposals I've read, but I suspect much work still needs to be done there, so this is something of a "stand alone" version. Extensible Content Sensitive Encoding Contact: David M. Williams david_williams@us.ibm.com 919-254-0362 The attached contribution (to follow) provides the ability to "peek" inside a file to determine its appropriate decoding. This is required for files for which this are common and "industry standard" rules ... such as UnicodeStreams, XML, JSP, DTDs, HTML and CSS.-- and this 'readme' file is contained in the zip file, as the package.html file under the primary project, com.ibm.encoding.resource It complements some of the work that's been going on with the Platform and Text teams, since I don't think the "detector" part of that spec has been implemented, so I hope this code can save them some effort, as well as provide a good "test case" if the ideas really work. This contribution focus entirely on the content sensitive part of the requirements .. there's other cases and other file types for which the content does not even give a hint as to the encoding. There are "hooks" in my code where that work can be tied into so, when appropriate, the algorithms can go and look up the settings according to user settings. For anyone taking a look at this code, here's an outline of the places to start ... the primary packages and classes. com.ibm.encoding.resource CodedReaderCreator -- creates a Reader with correct encoding set to read characters. CodedStreamCreator -- creates a ByteOutputStream, the bytes of which being correctly encoding for storing. com.ibm.encoding.resource.contentspecific contains "detectors" and infrastructure for XML, JSPs, HTML, CSS, and DTDs. The infrastructure simple means the mechanisms to associate the right encoding rules (detectors) with the right content type. This infrastructure makes use of "contentTypeIdentifers" which I contributed as an append to another bugzilla, but have re-included in this package for convenience. This means it should work even for .project files which contains NL characters. NOTE: the base eclipse has said they might provide one for XML, and I'm sure there might be hesitancy to include the others, for JSP, HTML, CSS, and DTDs, but I felt obligated to include them for two reasons: 1) just in case there is a desire (from community or others) to include them, and 2) You can't adequately test the design with just one case, so they will be helpful at least for that. While there's many unit tests which might prove helpful in understanding and verifying the code, its really hard to "see" and appreciate the results, without an editor, or some other to visually see the correct characters are there. How to use with an editor An outline of how to changed basic text editor (via file buffers) to have the content read and written correctly. Given the following sorts of changes, the basic text editor can open XML, JSP, HTML, CSS, and DTD files no matter what their "internal" encoding is (well, as long as its fairly well formed, and is supported by the VM). [And, yes, you heard right, that's the same editor, the editor shouldn't have much to do with encoding/decoding, since that's a "model level" responsibility.] ResourceTextFileBuffer commitFileBufferContent CodedStreamCreator codedStreamCreator = new CodedStreamCreator(); codedStreamCreator.set(fFile.getName(), fDocument.get()); ByteArrayOutputStream byteStream = codedStreamCreator. getCodedByteArrayOutputStream(); InputStream stream = new ByteArrayInputStream(byteStream. toByteArray()); //InputStream stream= new ByteArrayInputStream(fDocument.get(). getBytes(encoding)); initializeFileBufferContent CodedReaderCreator codedReaderCreator = new CodedReaderCreator(fFile); fEncoding = codedReaderCreator.getEncodingMemento(). getJavaCharsetName(); //fEncoding= fFile.getPersistentProperty(ENCODING_KEY); There's actually simpler ways than the above code indicates, but it would require more modification of the existing code, so I fit into what's there as easily as possible. (Also, I might note, these simple changes don't begin to cover error conditions, or eliminate incorrect messages. It does not update that file property or file states, or anything like that. Note, the code as contributed was based on I20040304. I know of some bugs still in it, and want to do more cleanup and documentation, but have finally gotten it to the point that I think others would study it and give any comments they'd find constructive. Possible Issues Maybe its the state of the current code, but it seems lots of objects need to know and have the file encoding set and synchronized, all at just the right times. This seems very confusing to me. I've tried to "centralize" all the encoding "intellegence" and algorithms in just a few classes. AND NOTE: these current classes could be "left seperate" or encapuslated under IFile/IStorage, but am not sure of the advantage of that, exactly ... maybe I've just gotten too used to thinking of IFile and IStorage as binary providing objects. I've argued elsewhere, and still believe that the simple 'charset' string is not enough to know how to re-write a file. In many well known cases, it depends on how it was read. For example, a file can be UTF-8 with or without the 3 Byte BOM. So, I advocate a simple "encodingMemento" be available that can provide detailed information about how a stream was decoded, and in a good system, this information would influence how it was later encoded again. Even if the base didn't want to support such elaborate stategies, if getCharsetMemento() returned a mememto with one method, getCharset(), then this would seem to leave the way open for other to implement more elaborate strategies as required by their products. ContentTypeIdentifier. I haven't heard any direct feedback, about my original proposal, but have seen discussions of even more complicated collections of arbitrary information about the contents of a file. That might be a nice thing to have, and might work, but its hard for me to envision how that works without a low-level identifiction first. There's a certain order that, seems to me, to be required. It might be summarized as before you can determine complex type information, you need to know how to decode the file, and before you can decode the file you need to know what type the file is [and sometimes the file's extension is not enough to determine its type. Maybe I'm over-reacting, but I fear if there's lots of different, uncoordinated information about files that are all need to be saved and modified, all in just the right order, that Eclipse will end up appearing like some "proprietary" IDE ... meaning that things work fine when in Eclipse, but not work fine when exported. As a simple example, if a user (or program) sets a documents encoding to UTF-16, but due to project settings, or some timing problem it actually gets saved as UTF-8, then that file would be unreadable. I prefer the central object that knows, rather than many objects that all have to stay in synch. So ... just a "possible issue".
Created attachment 8384 [details] Proposed Contribution to handle encoding based on a file's content This zip file contains 7 projects, but only one is the 'primary' one, com.ibm.encoding.resource. I20040304 is the last build I've looked at, so hope I'm not way behind :) and hope some use can be made of this contribution. Thanks.
New file encoding support is now available in the latest builds. Moving to Rafael for comment/closure.
I've looked at I200404131323. Compile run and debug look fine, but I got errors on Compare and Search function. I've tried following scenario. 1-create java program on RHEL 3.0WS Japanese locale (workbench default encoding is EUC-JP). 2-create java project on windows2003 (workench default encoding is MS932), and change EUC-JP at the project properties > info page 3-import the java programs into the project 4-try run, debug, search, edit and compare from local history Run and debug look fine. Search with Japanese text pop up an error and .log attached below. Compare indicated unmodified Japanese text since Japanese text are garbled in original file. it looks like it's encoded by EUC-JP. !SESSION 4 20, 2004 19:16:10.500 ----------------------------------------------- java.fullversion=J2RE 1.4.2 IBM Windows 32 build cndev-20040322 (JIT enabled: jitc) BootLoader constants: OS=win32, ARCH=x86, WS=win32, NL=ja_JP !ENTRY org.eclipse.core.runtime 4 2 4 20, 2004 19:16:10.516 !MESSAGE An internal error occurred during: "Search for References". !STACK 0 java.lang.NullPointerException at org.eclipse.search2.internal.ui.SearchView.queryFinished (SearchView.java:449) at org.eclipse.search2.internal.ui.QueryManager.fireFinished (QueryManager.java:108) at org.eclipse.search2.internal.ui.QueryManager.queryFinished (QueryManager.java:126) at org.eclipse.search2.internal.ui.InternalSearchUI.searchJobFinished (InternalSearchUI.java:151) at org.eclipse.search2.internal.ui.InternalSearchUI.access$1 (InternalSearchUI.java:149) at org.eclipse.search2.internal.ui.InternalSearchUI$InternalSearchJob.run (InternalSearchUI.java:133) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:62)
Created attachment 9688 [details] sample java program encoded by EUC-JP
Created attachment 9689 [details] screen shot of Compare
Please file separate bug reports against the search and compare components and list detailled steps for how to reproduce the problem.
ok, I've just filed bug 59228 for Search and bug 59232 for Compare problem.
Plan items must target release.
Many of the issues raised in this bug or the platform-core dev list were addressed (at least at the Core level). Encoding doc to be updated accordingly soon. Please open separate bug against Platform/Core for any encoding-related issues that may arise. Interesting starting points: - IContentTypeManager#getDescriptionFor - IEncodedStorage#getCharset - IFile#getCharset([boolean]) - IFile#getContentDescription - IContentDescription#BYTE_ORDER_MARK/CHARSET Thanks for the great feedback.
Closing.