Bug 191525 - [content type] Content type system should recognize text files with unknown extensions
Summary: [content type] Content type system should recognize text files with unknown e...
Status: CLOSED WONTFIX
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.3   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Platform-Resources-Inbox CLA
QA Contact:
URL:
Whiteboard: stalebug
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-07 13:43 EDT by Stefan Xenos CLA
Modified: 2020-02-06 15:09 EST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Xenos CLA 2007-06-07 13:43:40 EDT
Create a .html file. Run the following snippet:

IContentType contentType = Platform.getContentTypeManager().findContentTypeFor(contents, fileName);

Observed:
- Eclipse doesn't have a specific type registered for HTML files by default, so the content type is null.

Expected:
- The content type points to the "text" content type. Although Eclipse doesn't specifically have an html content type, it does have a content type for text files and that type recognizer should have been able to detect a file that parses correctly using the default character encoding.


Why is this important?

- My RCP app uses the stream merger support from CompareUI. The stream mergers are indexed by content type. Most files the user encounters will be text files, however very few of them will have the *.txt extension. This bug means that unless the user first registers their file extensions with the content types preference page, they will be unable to merge them.
Comment 1 John Arthorne CLA 2007-06-07 14:12:57 EDT
I don't know any way of detecting whether a file "parses correctly" for a given character encoding. An encoding is just a way of mapping between characters and bytes, and many encodings will have a valid mapping for just about any sequence of bytes. I'm not sure we should treat all files as text if we can't otherwise determine their content type.

Having said that, when I create a file with some random extension in Eclipse, I am able to compare that file as text with local history or CVS repository. Perhaps the compare API has some way of defaulting to a text comparison when the content type is unknown?
Comment 2 Michael Valenta CLA 2007-06-07 14:35:17 EDT
The BinaryMergeViewer presents the user with the option of reverting to a text compare. However, Stefan is talking about a headless merge so that doesn't help here. There are really two issues here:

1) The Eclipse Platform should do a better job to determine whether a file with an unknown content type is text or not.

2) What should clients do when they encounter an unknown content type.

I guess one could argue that if we did case 1 properly, case 2 would never come up. I don't know enough about the content type determination algorithm to know if there is a reliable way to differentiate text files from binary files. Based on past discussions we've had, I believe the problem is harder than it appears (e.g. can you reliably distinguish a UTF-16 encoded file from a binary file just given the bytes).

As for case 2, CVS handles this by associating a keyword mode with each individual file in the repository (in essence, the file is tagged as either binary or text). This way there is never any ambiguity as to whether a file is a candidate for auto-merge regardless of whether the content type is known or not. We do use the content types (and the Team file content type API) to determine the initial type (and we prompt the user if the type is unknown). We also provide an action to change the type if it was mistyped.

Comment 3 Stefan Xenos CLA 2007-06-07 15:59:05 EDT
We don't need to be able to detect all text files, and it shouldn't really matter if we occasionally occasionally detect a binary file as text. In these cases, the user can still resolve the ambiguity by going to the content types preference page (which is what they have to do now anyway).

All I want is to reduce the cases where the user needs to go to the preference page.

Here's a simple algorithm: Try parsing the file as UTF-8. If it works, treat it as a text file. This should correctly detect most text files (even those encoded in ISO-8859-1 or US-ASCII) and result in few false positives.

We could also make use of the OS's mime types, which would probably give us an even better answer... however, I assume that this would require additional API since the current APIs only take file contents and filename, and we'd presumably need to supply a fully-qualified path in order for the content type manager to look up the mime type.
Comment 4 John Arthorne CLA 2007-06-07 17:35:40 EDT
> Try parsing the file as UTF-8...

As far as I know, there is no such thing as a file that can't be parsed as UTF-8. While a UTF-8 file can optionally have a Byte Order Mark (BOM) to indicate the file type, this is not required (and Java for example doesn't write one automatically). So, in effect I think your algorithm would cause all unknown content types to be treated as text.  I don't think this is a good approach for a general-purpose content type API. If we don't know the content type, then it's more accurate to say "I don't know" than to say, "It's text". If a client wants to treat unknown content types as text, that option is available. 
Comment 5 Stefan Xenos CLA 2007-06-13 00:34:18 EDT
> As far as I know, there is no such thing as a file that can't be parsed 
> as UTF-8.

I thought most random files wouldn't parse as UTF-8, since there's some strict rules about what bytes can follow when the high bit is set... (I know I've seen files that Eclipse couldn't open as UTF-8) but perhaps I'm wrong  - in which case the algorithm is flawed. 

In that case, it should be possible to use the US-ASCII character set. Since ASCII always sets the high bit to 0, the probability of a random file falsely detecting as text would be 1/2^n, where n is the number of bytes.


> If we don't know the content type, then it's more accurate to say 
> "I don't know" than to say, "It's text".

I agree that an algorithm that always said "it's text" for all unknown inputs would be flawed. I'm aiming at something that would be right 90% of the time or better, and would tend to err on the side of detecting text files as binary rather than detecting binary files as text.
Comment 6 Evan Hughes CLA 2007-07-10 17:14:44 EDT
The *nix "file" utility manages to guess MIME type fairly accurately by examining the file name and a portion of the file contents. I don't see why it wouldn't be possible to adapt the algorithm (and file of heuristics) to do the same thing. IIRC it also guesses the character encoding of text files as well.  

Incidentally, there's at least one open-source project to do the same thing: http://sourceforge.net/projects/jmimemagic/ (licensed LGLP). It looks like it's under active development.
Comment 7 John Arthorne CLA 2007-07-11 10:47:23 EDT
Eclipse has a similar system but it uses a notion of "content type" rather than "mime type". Plug-ins can provide implementations of IContentDescriper that can determine content type and optionally charset for a given input stream. The problem here is whether we can distinguish between text and binary when the file extension doesn't give us any clues. The current implementation (TextContentDescriber) just looks for a UTF byte-order mark, and if it's not present, it says "I don't know".
Comment 8 Evan Hughes CLA 2007-07-11 11:17:27 EDT
> The problem here is whether we can distinguish between text and binary when the
> file extension doesn't give us any clues.

Given that the *nix "file" utility apparently can distinguish between text and binary based on content, it probably deserves examination. Perhaps someone with *nix on their desktop (Stefan?) could look into its heuristic and provide us with suggestions. 
Comment 9 John Arthorne CLA 2007-07-11 12:02:17 EDT
Please don't describe the *nix heuristics in this bug. Comments in this bugzilla are subject to Eclipse licensing terms which are likely not compatible.
Comment 10 Stefan Xenos CLA 2007-07-11 16:48:09 EDT
> Please don't describe the *nix heuristics in this bug.

Good point. Besides, I'm pretty sure we can come up with a respectable solution ourselves.
Comment 11 Mickael Istria CLA 2017-07-12 04:40:10 EDT
And what about we try to read the file with the default encoding and check that all characters in it are "human-readable"?
Comment 12 Mickael Istria CLA 2017-07-12 07:32:16 EDT
With the recent addition of unknownEditorStrategy, and in the IDE of the Marketplace discovery for unsupported files, this whole discussion may have become irrelevant.
Indeed, a user currently opening an HTML file with the IDE will have the IDE failing at resolving to a content-type and then to an editor; and then MPC will be queried for a good plugin to deal with HTML. This is a good user-story as user ends up with a dedicated rich tool for HTML in the IDE to make them more productive.
If we go for automatic support of some content-types as text, the user story here becomes that user opens the HTML file, the IDE recognizes it as Text, and the user is presented the default Eclipse Text Editor, which is pretty weak and not satisfying for the user. This is a bad user-story as the IDE fails at delivering added value over a basic notepad.

Declaring a content-type that can resolve to an editor implies that the editor is a good match for the target content-type. In the case of HTML, user of an IDE don't perceive a plain text editor as a good match.

I'd be tempted to close the bug as WONTFIX, as the current state provides better user satisfaction than the proposals here.
Comment 13 Eclipse Genie CLA 2020-02-06 15:09:44 EST
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.