Bug 125639 - Odd that binary resources searched when, e.g. * used as file pattern
Summary: Odd that binary resources searched when, e.g. * used as file pattern
Status: CLOSED DUPLICATE of bug 475602
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Search (show other bugs)
Version: 3.2   Edit
Hardware: PC Windows XP
: P3 minor (vote)
Target Milestone: 3.2 RC1   Edit
Assignee: Martin Aeschlimann CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-29 03:00 EST by David Williams CLA
Modified: 2015-09-11 09:48 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2006-01-29 03:00:21 EST
Guess it doesn't happen often, but I'm always surprised to see zips and jars sometimes pop up in hit list. I would think this would be a big performance drain. 
(to index them too?). I realize sometimes people do want to do a text search in a binary file (as a work around for missing a better search solution) but ... seems like there should be a check box to "include binary resources". 

At least, until there was binary "interpreters" that knew how to (quickly) interpret the binary file (zip vs. jar vs. gif vs. jpg, etc) and "transform" to correct form sutable for text searching.
Comment 1 Martin Aeschlimann CLA 2006-01-29 17:08:43 EST
Is there an easy way to know if a resource contains text?
Comment 2 David Williams CLA 2006-01-29 21:07:44 EST
Yes, assuming you are working with IResources ... the psuedoCode be similar to the following (there could be lots of other, similar possibilities, dpending on details. Technicall, all content types in the platform must be of type (or subtype) of text, binary, or null (undefined) .... I'd suggest including "undefined" in to the binary lump in this context, but .. would have to "survey" many workspaces to know for sure). So, following is conservative variation, including only text. Could also be use 'binary' and exclude binary (to get the unknwon/undefined types to be searched too). 



// could be static or other semi-global variable
IContentType textType = Platform.getContentTypeManager()
  .getContentType("org.eclipse.core.runtime.text");
  
IFile fileToSearchOrIndex = ...<getIFile>... ;



IContentDescription description = fileToSearchOrIndex.getContentDescription();
if (description != null && description.getContentType().isKindOf(textType)) 
  {
  // do search or index processing with IFile 
  }
    

Comment 3 David Williams CLA 2006-01-29 21:27:06 EST
Note, I've also opened bug 125663 as a generalized idea of using content types in search. If it was implemented, it might satisfy this bug too. 



I'm also adding Rafeal to CC list (hope you don't mind), as he's the gaurdian of all things content type, so might have some interest in effects on performance, etc. (and might anticipate bad side effects, as well as good ones for users?). 

[for example, if contentType providers did not implement correctly, or did not specify correctly in their manifest.mf files, this change might, if implemented as I desscribed above, might cause more contentType processing to occur at earlier points in time (than currently) so might cause plugins to be activated earlier (again, if contentType providers not implemented correctly). 

Comment 4 Dani Megert CLA 2006-01-30 03:54:21 EST
Be aware that the file pattern suggested in comment 2 assumes that all existing file types are covered/registered, which is normally not the case. This pattern would, e.g. force me to install a C++ plug-in in order to find textual matches in a .ccp file since there's no '.cpp' definition in standard Eclipse SDK. -1 from my side for this change.

But what the code can do for each file is to check whether there's a content type, if not search it and if yes, only search it if it is 'text'. You might want to use ITextFileBufferManager.isTextFileLocation(IPath, boolean).

Rafael, I would assume that in general IFile.getContentDescription().getContentType() didn't load a plug-in but what if the content type uses an IContentDescriber?
Comment 5 Rafael Chaves CLA 2006-01-30 09:33:22 EST
I agree on assuming that files with unknown content types to potentially be text.

Re: plug-in activation - content type providers that contribute their own content describers are expected to exclude the corresponding package from auto-activation. Anyway, even if they misbehave and don't do that, content describers are instantiated right away when the content type manager is started, so you should not be concerned about that.
Comment 6 David Williams CLA 2006-01-30 11:08:44 EST
Given that non-defined types are included in text searches the psuedo code would be similar to ... 


// could be static or other semi-global variable
IContentType binaryType = Platform.getContentTypeManager()
  .getContentType("org.eclipse.core.runtime.binary");

IFile fileToSearchOrIndex = ...<getIFile>... ;



IContentDescription description = fileToSearchOrIndex.getContentDescription();

if (description == null || !(description.getContentType().isKindOf(binaryType)) 
      {
      // do search or index processing with IFile 
      }


(I myself still have doubts about best way ... since if not included, users could included them selves via preferences (e.g. adding .cpp to "text" content type) .. but, agree .. I have not done enough survey's of other's workspaces to know exactly what impact would be ... so, for this bug, let's leave it as excluding the "known binary" types ... since I that is my main concern, since zips, jars, images, etc., can all be big, and, most of the time not desired to be indexed or searched). 

Comment 7 David Williams CLA 2006-01-30 11:17:49 EST
Oh, and wanted to remind interested parties of bug 106830. 
While 'search' does not modify the files, I'm wondering if its a user-freiendly thing to allow/imply users could modify (via subsequenet 'replace' operation) a file that had unknown (null) content type. If those happended to really be undeclared binary, we'd be allowing them to shoot themselves in the foot (e.g. if someone doing a wide spread update of "copyright date", or similar. )


Comment 8 Rafael Chaves CLA 2006-01-30 11:24:18 EST
Note that there is no base binary content type, it is the other way around. From IContentTypeManager.CT_TEXT spec:

"All text-based content types ought to be sub types of the content type identified by this string. This provides a simple way for detecting whether a content type is text-based: 

 IContentType text = Platform.getContentTypeManager().getContentType(IContentTypeManager.CT_TEXT);
 IContentType someType = ...;
 boolean isTextBased = someType.isKindOf(text);"
Comment 9 David Williams CLA 2006-01-30 12:23:27 EST
(In reply to comment #8)
> Note that there is no base binary content type.

Ooohhhh, well, thanks. There should be, though. Is 
org.eclipse.core.runtime.content.BinarySignatureDescriber
the only "binary" case provided by base? 

But, this makes me feel all the stronger that user's choice should be a little check box that says "text files only" (its inverse would be "all") and text files only would meean only "well defined text files" ... not also "unknown" types. 
I think this matches users expectations better, allows subsequent "replace" operation to work naturally as expected, and if users wanted more "text" types (e.g. for cpp) they could add to text content type themselves. 

Comment 10 Rafael Chaves CLA 2006-01-30 12:57:29 EST
No, I was just pointing out that the syntax for deciding whether a content type is text-based or not was different from what you posted (because there is no root binary content type to conmpare against, instead there is a text root content type - all others are assumed to be binary).

Re: BinarySignatureDescriber - that is an unrelated issue in my opinion. It is there only to help people that want to provide their own binary content types that have signatures at fixed locations (such as .class files' 0xCAFEBABE) by freeing them from having to provide their own and worry about avoiding auto-activation.
Comment 11 David Williams CLA 2006-01-30 13:18:01 EST
Ok. Thanks for clarifying .. so we're left with 

Text content type (or child of ~): text
any other content type not a child of text: binary
no content type at all (null): text or binary (undefined)

So, actually if there is no "zip" or "jar" contentTypes (I dont' think there is?) 
then including the undefined types in with potential text would 
not even help my original observations that prompted this bug report. 
Who knows, maybe its already implemented as stated, and we just need 
to introduce a generic binary content type :) 
Comment 12 Rafael Chaves CLA 2006-01-30 13:36:31 EST
You are right, the result will be the same unless you have the option to ignore unknown content types (BTW: bug 106832 requests content types for JAR files, which should be a specialization of ZIP files - should be provided as well).
Comment 13 Martin Aeschlimann CLA 2006-04-06 13:23:29 EDT
fixed > 20060406, thanks Markus for your help.

When searching for '*' we skip all files that are 'binary'.
A file is considered binary if it is not of content type 'TEXT' and its content contains a '/0' character.
To be precice, we read files in pages, and we only test the first page for '0''s, for performance reasons.

Let me know if you have a better heuristic.
So far I'm pretty excited about this change: Search with '*' is performing great now: I therefor changed the Text Search actions to always use '*' (before they used the last used file extensions)
Comment 14 Markus Schorn CLA 2006-04-07 08:30:35 EDT
To be precise, testing only the first 256K is not a performance win, because in case the file is not binary, for determining the length of the CharacterSequence, the complete file is read anyhow! I suggest testing the full file.
There is another option for improving performance: If we do the binary test within the FileCharSequenceProvider we can stop reading the file when hitting the first character that is zero, which usually will be well before the first 256K. The performance gain will not be great, though.
In case you care for one or the other change you may assign the work to me.
Comment 15 Martin Aeschlimann CLA 2006-04-07 08:45:30 EDT
Ok, agree. As long as the reg-ex implementation always does a getLength() first, the file is completly read anyways.
I'm still hoping that they improve this in some future implementation of regex.

For the second suggestion, we would have to measure this:
Reading a block of data is faster than reading one by one. So, maybe it would be worth doing a separate, unrelated pre-read to evaluate if a file is binary. This could be done with a smaller array, and could also ignore charset-decoding. Even if the information read would be lost, it would still bring the file into the filesystem cache and the second, full file read would be faster.

Yes, help is always welcome! 
Comment 16 David Williams CLA 2006-04-07 11:47:15 EDT
In my humble opinon, heurtically reading the file at all, looking for '0x00' should not be needed at all. Why is thig being done? Why is not ContentType sufficient? 

Could it be we need more 'text' types defined? (bug 106832) And jar and zip types? (I think they are now, right?) 

Besides being heuristic, my concern about the heuristic is that you are not achiving the savings you otherwise could be. Plus, you are picking this heuristic for *your* search ... what if there were others? What if there were others that were supposed to use or match your resutls? This then, sort of, becomes a "hidden API spec" of behavior that just seems overly ocmplicated to me. I think the problem you are trying to solve with the heuristic could be solved in other ways .... what problem are you trying to solve with the heuristc, exactly? 

Comment 17 Martin Aeschlimann CLA 2006-04-07 12:10:46 EDT
I also wished that all text files use context type 'text' or all binary files use something like 'binary'.
The chance to miss some text files and confuse the user with missing matches has to be avoided. We rather report too many matches that missing real ones.
So far the only binary content type I see is 'Java Class File'. Search doesn't know about Java, so we really need the platform to introduce the 'Binary' base type and we can get bigger improvements.

I'm not sure what you mean by 'only a heuristic for *your* search'. This is text search and text is well defined. 
Comment 18 David Williams CLA 2006-04-07 12:35:18 EDT
(In reply to comment #17)

> 
> I'm not sure what you mean by 'only a heuristic for *your* search'. This is
> text search and text is well defined. 
> 

Do you mean well defined for you? for eclipse? :) 
Or, maybe you mean its well defined with lots of quirks and possible prefrences settings? 

If you read something that explains the 'grep' command in detail (such as its 'man') you'll get a better sense of what I mean. They too (sometimes) use heuristics for some things, sometimes related to the 'binary' nature of a file, but they alsway have to allow some options to let a user specify more exactly how they want the search done. 

Comment 19 Markus Schorn CLA 2006-04-10 05:43:54 EDT
The chances that you can define content-types for all possible files are low. That is true for both binary and text files. Especially files that do not have a file-extension are a problem. If you look at the source code of the Mozilla project you find many examples for files with file-extension. I am listing only those where the extension starts with an 'a':
Text files: hppatch.adb, en-US.aff, makefiles.all, README.alpha, *.amiga, Makefile.am, *.api
Binary files: testembed.aps

When I am searching for some text within the project, I cannot figure out all possible extensions first, and get the content-types right. ==> I need to search files with unknown content-type.

At the same time it is a pain to get text-search matches within a binary file. The heuristics that excludes files containing '\000' helps with that. I agree that the concept of a binary file is unprecise. Although I'd consider a file containing a '\000' to be binary for sure, the heuristics may fail to detect some binary files.

Certainly I'd be good to detect as many binary files as possible by looking at the file-name. So if we can assume that any content type not deriving from IContentTypeManager.CT_TEXT is binary (see Comment 8) we should change the implementation such that those are not searched. Then you can avoid scanning through the binaries by defining content-types, although executables without a file-extension will remain to be a problem.
Comment 20 Dani Megert CLA 2015-09-11 09:48:16 EDT
.

*** This bug has been marked as a duplicate of bug 475602 ***