547230 – Saving text file corrupts non-ASCII characters if encoding mismatch

Bug 547230 - Saving text file corrupts non-ASCII characters if encoding mismatch

Summary: Saving text file corrupts non-ASCII characters if encoding mismatch

Status:	REOPENED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Text (show other bugs)
Version:	4.11
Hardware:	All All

Importance:	P3 major (vote)
Target Milestone:	---
Assignee:	Platform-Text-Inbox
QA Contact:

URL:
Whiteboard:	stalebug
Keywords:

Depends on:
Blocks:

Reported:	2019-05-13 12:47 EDT by David Balažic
Modified:	2021-05-05 07:27 EDT (History)
CC List:	2 users (show)

See Also:	547228

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Balažic

2019-05-13 12:47:05 EDT

- have a file in some project encoded in CP1250 encoding, containing non-ASCII characters, like: č
 - in Eclipse, set the encoding for that file to UTF-8  (this happens often when working with legacy files while the default in Eclipse is UTF-8)
 - open the file in Eclipse (double click it in Project Exporer, to open a Text Editor)
 - make some change, like add a space character
 - save: ctrl-s

Result:
the non-ASCII characters in the file are corrupted (each replaced with some multi-
byte code)


Should be: 
either:
 - unedited characters should stays as they are
or
 - a dialog saying what happened (or a warning, that it will happen)

Version:
Eclipse IDE for Enterprise Java Developers.

Version: 2019-03 (4.11.0)
Build id: 20190314-1200

Comment 1 Eclipse Genie

2021-05-05 00:28:58 EDT

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.

Comment 2 David Balažic

2021-05-05 06:11:56 EDT

This still happens in the current version:

Eclipse IDE for Enterprise Java and Web Developers (includes Incubating components)

Version: 2021-03 (4.19.0)
Build id: 20210312-0638


Here are revised steps to reproduce:

 - have a file in some project encoded in CP1250 encoding, containing non-ASCII characters, like: č
(create a text file, set its encoding to CP1250, enter some characters, including non-ASCII, like čšž, save it, close the editor)

 - in Eclipse, set the encoding for that file to UTF-8  (this happens often when working with legacy files while the default in Eclipse is UTF-8), for example by right clicking the file in the Project Explorer, in the context menu select Properties, then there change the encoding to UTF-8
 - open the file in Eclipse (double click it in Project Exporer, to open a Text Editor)
 - make some change, like add a space character
 - save: ctrl-s

Result:
the non-ASCII characters in the file are corrupted (each replaced with some multi-
byte code)

To check: in Eclipse, set the file encoding back to CP1250, then open it in Eclipse

Comment 3 Thomas Wolf

2021-05-05 07:27:55 EDT

I'd be surprised if anybody would do anything here. If you have in Eclipse encoding UTF-8, the file will be read as UTF-8. Unless it's a special combination of non-ASCII characters that by chance would also be valid UTF-8 I would expect the content already to be garbled when opening the file.

Just set the correct encoding for the file.

Applying the encoding specified in Eclipse only on save and only to changed Characters would be very bad IMO. Consider your file encoded with CP1250
containing č, but Eclipse encoding for that file set to UTF-8. Now enter a š
somewhere else and save. You'd end up with a file with mixed encoding CP1250/UTF-8, with the č unchanged as CP1250 and the š stored in UTF-8.