327316 – Autodetect encoding of Files in Eclipse

Bug 327316 - Autodetect encoding of Files in Eclipse

Summary: Autodetect encoding of Files in Eclipse

Status:	NEW

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Resources (show other bugs)
Version:	3.6
Hardware:	All All

Importance:	P3 enhancement (vote)
Target Milestone:	---
Assignee:	Platform-Resources-Inbox
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-10-08 05:14 EDT by Krzysztof Kazmierczyk
Modified:	2017-11-26 19:05 EST (History)
CC List:	6 users (show)

See Also:

Attachments
. (3.98 KB, application/octet-stream) 2017-11-26 18:12 EST, Tobias Kastan	no flags	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Krzysztof Kazmierczyk

2010-10-08 05:14:31 EDT

Build Identifier: I20100608-0911

Currently you can only set manually encoding of files in Eclipse. We could add option to automatically guessing encoding of the file.

Reproducible: Always

Comment 1 Szymon Brandys

2010-10-08 05:31:22 EDT

So far we can recognize UTF encodings based on BOF, however I would like to hear how to make it more general.

Comment 2 Krzysztof Kazmierczyk

2010-10-08 06:08:37 EDT

The idea is to detect encoding by file content. The file is just a stream of bytes.
Some bytes for example do not occur in ISO encoding. We can also deduce if given file is valid utf8 file (there is a document which specifies utf format). You can eliminate list of invalid encodings and then provide the list of suggested encodings to select one by user.

Comment 3 Szymon Brandys

2010-10-08 06:13:10 EDT

(In reply to comment #1)
> So far we can recognize UTF encodings based on BOF,
Of course I meant BOM.

Comment 4 Krzysztof Kazmierczyk

2010-10-08 06:17:36 EDT

(In reply to comment #1)
> So far we can recognize UTF encodings based on BOF, however I would like to
> hear how to make it more general.

To clarify: we would like also to detect utf even if there is no BOM.

Comment 5 John Arthorne

2010-10-08 09:44:51 EDT

Don't we already have all this? We have a pluggable mechanism where people can provide IContentDescriber objects that read the bytes from the file and determine the encoding and content type. There are several of these built in, such as XMLContentDescriber that reads the first bytes of an XML file in order to determine encoding and content type. All of this is used to determine the content type for the common case where the user has not specified it manually.

Comment 6 Szymon Brandys

2010-10-08 10:19:15 EDT

(In reply to comment #5)
Well, I guess that Krzysztof K. is familiar with the content types and content describers mechanism and the request is to improve encoding recognition in Platform. So far we recognize only XML files based on BOM or the header.

Comment 7 Krzysztof Kazmierczyk

2010-10-11 11:33:46 EDT

(In reply to comment #5)
> Don't we already have all this? We have a pluggable mechanism where people can
> provide IContentDescriber objects that read the bytes from the file and
> determine the encoding and content type. There are several of these built in,
> such as XMLContentDescriber that reads the first bytes of an XML file in order
> to determine encoding and content type. All of this is used to determine the
> content type for the common case where the user has not specified it manually.

John, it is great that we have that mechanism. This probably will help us to deliver the solution what I exactly want to.

Here are sample steps to reproduce my issue:
1. Create outside Eclipse file with UTF-8 encoding. This is sample content of the file test.txt: 
Na&#380;ó&#322;&#263; g&#281;&#347;l&#261; ja&#378;&#324;\EOF
2. Create new Eclipse project with default encoding set to ISO-8859-1
3. Copy file from 1 to that project
4. Open file
What is:
The file has been opened with ISO-8859-1 encoding and contains different strange characters instead of Polish ones.
What we want:
Autodetect this files as UTF-8 text file and display it with polish characters in editor.

John, does it help for you?

Comment 8 Szymon Brandys

2010-10-11 12:58:00 EDT

(In reply to comment #7)
> Here are sample steps to reproduce my issue:
> 1. Create outside Eclipse file with UTF-8 encoding. This is sample content of
> the file test.txt: 
> Na&#380;ó&#322;&#263; g&#281;&#347;l&#261; ja&#378;&#324;\EOF
> 2. Create new Eclipse project with default encoding set to ISO-8859-1
> 3. Copy file from 1 to that project
> 4. Open file
> What is:
> The file has been opened with ISO-8859-1 encoding and contains different
> strange characters instead of Polish ones.
> What we want:
> Autodetect this files as UTF-8 text file and display it with polish characters
> in editor.

It seems that you want to recognize the encoding based on the content for text files or/and xml files? This could be implemented in TextContentDescriber, however it could affect the performance. Each file change would trigger an operation that checks whole file content to determine its encoding.

And of course content types that don't extend TextContentDescriber would not recognize encoding anyway.

For me a better approach is to have a tool (in your product) that scans the content of the selected file, determines its encoding and sets the charset on the file.

What do you think?

Comment 9 Krzysztof Kazmierczyk

2010-10-12 02:51:03 EDT

> For me a better approach is to have a tool (in your product) that scans the
> content of the selected file, determines its encoding and sets the charset on
> the file.
It is good approach because it allows to fix concerns about performance and correct encoding whatever I want it, but do you think that such tool would be useful in Eclipse itself, not only in products which implement such functionality?

Comment 10 Shinji Kashihara

2016-08-12 21:20:33 EDT

Autodetect Encoding Plugin as a workaround.
http://marketplace.eclipse.org/content/autodetect-encoding

Comment 11 Tobias Kastan

2017-11-26 18:12:08 EST

Created attachment 271648 [details]
.

Comment 12 Tobias Kastan

2017-11-26 19:05:39 EST

Comment on attachment 271648 [details]
.

did not mean to upload this, can someone delete? Thanks