Bug 327316 - Autodetect encoding of Files in Eclipse
Summary: Autodetect encoding of Files in Eclipse
Status: NEW
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.6   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Platform-Resources-Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-08 05:14 EDT by Krzysztof Kazmierczyk CLA
Modified: 2017-11-26 19:05 EST (History)
6 users (show)

See Also:


Attachments
. (3.98 KB, application/octet-stream)
2017-11-26 18:12 EST, Tobias Kastan CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Krzysztof Kazmierczyk CLA 2010-10-08 05:14:31 EDT
Build Identifier: I20100608-0911

Currently you can only set manually encoding of files in Eclipse. We could add option to automatically guessing encoding of the file.

Reproducible: Always
Comment 1 Szymon Brandys CLA 2010-10-08 05:31:22 EDT
So far we can recognize UTF encodings based on BOF, however I would like to hear how to make it more general.
Comment 2 Krzysztof Kazmierczyk CLA 2010-10-08 06:08:37 EDT
The idea is to detect encoding by file content. The file is just a stream of bytes.
Some bytes for example do not occur in ISO encoding. We can also deduce if given file is valid utf8 file (there is a document which specifies utf format). You can eliminate list of invalid encodings and then provide the list of suggested encodings to select one by user.
Comment 3 Szymon Brandys CLA 2010-10-08 06:13:10 EDT
(In reply to comment #1)
> So far we can recognize UTF encodings based on BOF,
Of course I meant BOM.
Comment 4 Krzysztof Kazmierczyk CLA 2010-10-08 06:17:36 EDT
(In reply to comment #1)
> So far we can recognize UTF encodings based on BOF, however I would like to
> hear how to make it more general.

To clarify: we would like also to detect utf even if there is no BOM.
Comment 5 John Arthorne CLA 2010-10-08 09:44:51 EDT
Don't we already have all this? We have a pluggable mechanism where people can provide IContentDescriber objects that read the bytes from the file and determine the encoding and content type. There are several of these built in, such as XMLContentDescriber that reads the first bytes of an XML file in order to determine encoding and content type. All of this is used to determine the content type for the common case where the user has not specified it manually.
Comment 6 Szymon Brandys CLA 2010-10-08 10:19:15 EDT
(In reply to comment #5)
Well, I guess that Krzysztof K. is familiar with the content types and content describers mechanism and the request is to improve encoding recognition in Platform. So far we recognize only XML files based on BOM or the header.
Comment 7 Krzysztof Kazmierczyk CLA 2010-10-11 11:33:46 EDT
(In reply to comment #5)
> Don't we already have all this? We have a pluggable mechanism where people can
> provide IContentDescriber objects that read the bytes from the file and
> determine the encoding and content type. There are several of these built in,
> such as XMLContentDescriber that reads the first bytes of an XML file in order
> to determine encoding and content type. All of this is used to determine the
> content type for the common case where the user has not specified it manually.

John, it is great that we have that mechanism. This probably will help us to deliver the solution what I exactly want to.

Here are sample steps to reproduce my issue:
1. Create outside Eclipse file with UTF-8 encoding. This is sample content of the file test.txt: 
Nażółć gęślą jaźń\EOF
2. Create new Eclipse project with default encoding set to ISO-8859-1
3. Copy file from 1 to that project
4. Open file
What is:
The file has been opened with ISO-8859-1 encoding and contains different strange characters instead of Polish ones.
What we want:
Autodetect this files as UTF-8 text file and display it with polish characters in editor.

John, does it help for you?
Comment 8 Szymon Brandys CLA 2010-10-11 12:58:00 EDT
(In reply to comment #7)
> Here are sample steps to reproduce my issue:
> 1. Create outside Eclipse file with UTF-8 encoding. This is sample content of
> the file test.txt: 
> Nażółć gęślą jaźń\EOF
> 2. Create new Eclipse project with default encoding set to ISO-8859-1
> 3. Copy file from 1 to that project
> 4. Open file
> What is:
> The file has been opened with ISO-8859-1 encoding and contains different
> strange characters instead of Polish ones.
> What we want:
> Autodetect this files as UTF-8 text file and display it with polish characters
> in editor.

It seems that you want to recognize the encoding based on the content for text files or/and xml files? This could be implemented in TextContentDescriber, however it could affect the performance. Each file change would trigger an operation that checks whole file content to determine its encoding.

And of course content types that don't extend TextContentDescriber would not recognize encoding anyway.

For me a better approach is to have a tool (in your product) that scans the content of the selected file, determines its encoding and sets the charset on the file.

What do you think?
Comment 9 Krzysztof Kazmierczyk CLA 2010-10-12 02:51:03 EDT
> For me a better approach is to have a tool (in your product) that scans the
> content of the selected file, determines its encoding and sets the charset on
> the file.
It is good approach because it allows to fix concerns about performance and correct encoding whatever I want it, but do you think that such tool would be useful in Eclipse itself, not only in products which implement such functionality?
Comment 10 Shinji Kashihara CLA 2016-08-12 21:20:33 EDT
Autodetect Encoding Plugin as a workaround.
http://marketplace.eclipse.org/content/autodetect-encoding
Comment 11 Tobias Kastan CLA 2017-11-26 18:12:08 EST
Created attachment 271648 [details]
.
Comment 12 Tobias Kastan CLA 2017-11-26 19:05:39 EST
Comment on attachment 271648 [details]
.

did not mean to upload this, can someone delete? Thanks