262025 – [spell checking] Store encoding in dictionary

Bug 262025 - [spell checking] Store encoding in dictionary

Summary: [spell checking] Store encoding in dictionary

Status:	ASSIGNED

Alias:	None

Product:	JDT
Classification:	Eclipse Project
Component:	Text (show other bugs)
Version:	3.4.1
Hardware:	All All

Importance:	P3 enhancement (vote)
Target Milestone:	---
Assignee:	JDT-Text-Inbox
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-01-22 10:34 EST by Chris Simmons
Modified:	2009-01-23 04:23 EST (History)
CC List:	1 user (show)

See Also:

Attachments
Sample dictionary project, using UTF-8 encoded dictionary. (2.61 KB, application/zip) 2009-01-22 10:34 EST, Chris Simmons	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Simmons

2009-01-22 10:34:03 EST

Created attachment 123381 [details]
Sample dictionary project, using UTF-8 encoded dictionary.

Build ID: M20080911-1700

Steps To Reproduce:
1. Run eclipse with attached cut-down portugueuse dictionary, encoded using UTF-8.  It contributes a dictionary fragment to the jdt spell checker.  The key thing to note is that the dictionary contains non-ascii characters.

2. Everything's fine if your spelling/platform encoding is UTF-8; note "abóbada" is in the dictionary (add to javadoc say).

3. However, run with -Dfile.encoding=Cp1250 and it doesn't look so rosy, its decoded the dictionary using the wrong encoding.  It suggests chaging "abóbada" to "ab&#258;&#322;bada" presumably because that's what the Cp1250 mis-decoding results in.


More information:
I think the dictionary encoding preference only really makes sense for user dictionaries as they stand.  In reality any dictionary resource has some fixed character encoding (in this case UTF-8) and you're not going to get very far using the wrong decoder.

Perhaps the dictionary should include the encoding in its locale as in

pt_PT.UTF-8.dictionary

or some such?  Linux appears to do something along these lines.


With a full-size dictionary that I'm not including the decoder eventually fell over (it got somewhere in the o's):-

CoderResult.throwException() line: 261 [local variables unavailable]	
StreamDecoder.implRead(char[], int, int) line: 319	
StreamDecoder.read(char[], int, int) line: 158	
InputStreamReader.read(char[], int, int) line: 167	
BufferedReader.fill() line: 136	
BufferedReader.readLine(boolean) line: 299	
BufferedReader.readLine() line: 362	
LocaleSensitiveSpellDictionary(AbstractSpellDictionary).load(URL) line: 500	

I have absolutely no understanding of Portuguese so apologies if these random words happen to be offensive or some such :)

Comment 1 Chris Simmons

2009-01-22 10:38:32 EST

Looks like the "ab&#258;&#322;bada" got mangled by bugzilla I'm afraid.

Comment 2 Dani Megert

2009-01-23 03:59:31 EST

We currently only support that all dictionaries have the same encoding and that this encoding is given on the 'Spelling' preference page. Hence what you describe is not a bug but rather a user error. I see your point though.

I'm turning this bug into an enhancement to add the encoding information to the dictionary.

Comment 3 Chris Simmons

2009-01-23 04:23:37 EST

Thanks for looking into this :)