Bug 262025 - [spell checking] Store encoding in dictionary
Summary: [spell checking] Store encoding in dictionary
Status: ASSIGNED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Text (show other bugs)
Version: 3.4.1   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: JDT-Text-Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-22 10:34 EST by Chris Simmons CLA
Modified: 2009-01-23 04:23 EST (History)
1 user (show)

See Also:


Attachments
Sample dictionary project, using UTF-8 encoded dictionary. (2.61 KB, application/zip)
2009-01-22 10:34 EST, Chris Simmons CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Simmons CLA 2009-01-22 10:34:03 EST
Created attachment 123381 [details]
Sample dictionary project, using UTF-8 encoded dictionary.

Build ID: M20080911-1700

Steps To Reproduce:
1. Run eclipse with attached cut-down portugueuse dictionary, encoded using UTF-8.  It contributes a dictionary fragment to the jdt spell checker.  The key thing to note is that the dictionary contains non-ascii characters.

2. Everything's fine if your spelling/platform encoding is UTF-8; note "abóbada" is in the dictionary (add to javadoc say).

3. However, run with -Dfile.encoding=Cp1250 and it doesn't look so rosy, its decoded the dictionary using the wrong encoding.  It suggests chaging "abóbada" to "abĂłbada" presumably because that's what the Cp1250 mis-decoding results in.


More information:
I think the dictionary encoding preference only really makes sense for user dictionaries as they stand.  In reality any dictionary resource has some fixed character encoding (in this case UTF-8) and you're not going to get very far using the wrong decoder.

Perhaps the dictionary should include the encoding in its locale as in

pt_PT.UTF-8.dictionary

or some such?  Linux appears to do something along these lines.


With a full-size dictionary that I'm not including the decoder eventually fell over (it got somewhere in the o's):-

CoderResult.throwException() line: 261 [local variables unavailable]	
StreamDecoder.implRead(char[], int, int) line: 319	
StreamDecoder.read(char[], int, int) line: 158	
InputStreamReader.read(char[], int, int) line: 167	
BufferedReader.fill() line: 136	
BufferedReader.readLine(boolean) line: 299	
BufferedReader.readLine() line: 362	
LocaleSensitiveSpellDictionary(AbstractSpellDictionary).load(URL) line: 500	

I have absolutely no understanding of Portuguese so apologies if these random words happen to be offensive or some such :)
Comment 1 Chris Simmons CLA 2009-01-22 10:38:32 EST
Looks like the "abĂłbada" got mangled by bugzilla I'm afraid.
Comment 2 Dani Megert CLA 2009-01-23 03:59:31 EST
We currently only support that all dictionaries have the same encoding and that this encoding is given on the 'Spelling' preference page. Hence what you describe is not a bug but rather a user error. I see your point though.

I'm turning this bug into an enhancement to add the encoding information to the dictionary.
Comment 3 Chris Simmons CLA 2009-01-23 04:23:37 EST
Thanks for looking into this :)