125639 – Odd that binary resources searched when, e.g. * used as file pattern

Bug 125639 - Odd that binary resources searched when, e.g. * used as file pattern

Summary: Odd that binary resources searched when, e.g. * used as file pattern

Status:	CLOSED DUPLICATE of bug 475602

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Search (show other bugs)
Version:	3.2
Hardware:	PC Windows XP

Importance:	P3 minor (vote)
Target Milestone:	3.2 RC1
Assignee:	Martin Aeschlimann
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-01-29 03:00 EST by David Williams
Modified:	2015-09-11 09:48 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Williams

2006-01-29 03:00:21 EST

Guess it doesn't happen often, but I'm always surprised to see zips and jars sometimes pop up in hit list. I would think this would be a big performance drain. 
(to index them too?). I realize sometimes people do want to do a text search in a binary file (as a work around for missing a better search solution) but ... seems like there should be a check box to "include binary resources". 

At least, until there was binary "interpreters" that knew how to (quickly) interpret the binary file (zip vs. jar vs. gif vs. jpg, etc) and "transform" to correct form sutable for text searching.

Comment 1 Martin Aeschlimann

2006-01-29 17:08:43 EST

Is there an easy way to know if a resource contains text?

Comment 2 David Williams

2006-01-29 21:07:44 EST

Yes, assuming you are working with IResources ... the psuedoCode be similar to the following (there could be lots of other, similar possibilities, dpending on details. Technicall, all content types in the platform must be of type (or subtype) of text, binary, or null (undefined) .... I'd suggest including "undefined" in to the binary lump in this context, but .. would have to "survey" many workspaces to know for sure). So, following is conservative variation, including only text. Could also be use 'binary' and exclude binary (to get the unknwon/undefined types to be searched too). 



// could be static or other semi-global variable
IContentType textType = Platform.getContentTypeManager()
  .getContentType("org.eclipse.core.runtime.text");
  
IFile fileToSearchOrIndex = ...<getIFile>... ;



IContentDescription description = fileToSearchOrIndex.getContentDescription();
if (description != null && description.getContentType().isKindOf(textType)) 
  {
  // do search or index processing with IFile 
  }

Comment 3 David Williams

2006-01-29 21:27:06 EST

Note, I've also opened bug 125663 as a generalized idea of using content types in search. If it was implemented, it might satisfy this bug too. 



I'm also adding Rafeal to CC list (hope you don't mind), as he's the gaurdian of all things content type, so might have some interest in effects on performance, etc. (and might anticipate bad side effects, as well as good ones for users?). 

[for example, if contentType providers did not implement correctly, or did not specify correctly in their manifest.mf files, this change might, if implemented as I desscribed above, might cause more contentType processing to occur at earlier points in time (than currently) so might cause plugins to be activated earlier (again, if contentType providers not implemented correctly).

Comment 4 Dani Megert

2006-01-30 03:54:21 EST

Be aware that the file pattern suggested in comment 2 assumes that all existing file types are covered/registered, which is normally not the case. This pattern would, e.g. force me to install a C++ plug-in in order to find textual matches in a .ccp file since there's no '.cpp' definition in standard Eclipse SDK. -1 from my side for this change.

But what the code can do for each file is to check whether there's a content type, if not search it and if yes, only search it if it is 'text'. You might want to use ITextFileBufferManager.isTextFileLocation(IPath, boolean).

Rafael, I would assume that in general IFile.getContentDescription().getContentType() didn't load a plug-in but what if the content type uses an IContentDescriber?

Comment 5 Rafael Chaves

2006-01-30 09:33:22 EST

I agree on assuming that files with unknown content types to potentially be text.

Re: plug-in activation - content type providers that contribute their own content describers are expected to exclude the corresponding package from auto-activation. Anyway, even if they misbehave and don't do that, content describers are instantiated right away when the content type manager is started, so you should not be concerned about that.

Comment 6 David Williams

2006-01-30 11:08:44 EST

Given that non-defined types are included in text searches the psuedo code would be similar to ... 


// could be static or other semi-global variable
IContentType binaryType = Platform.getContentTypeManager()
  .getContentType("org.eclipse.core.runtime.binary");

IFile fileToSearchOrIndex = ...<getIFile>... ;



IContentDescription description = fileToSearchOrIndex.getContentDescription();

if (description == null || !(description.getContentType().isKindOf(binaryType)) 
      {
      // do search or index processing with IFile 
      }


(I myself still have doubts about best way ... since if not included, users could included them selves via preferences (e.g. adding .cpp to "text" content type) .. but, agree .. I have not done enough survey's of other's workspaces to know exactly what impact would be ... so, for this bug, let's leave it as excluding the "known binary" types ... since I that is my main concern, since zips, jars, images, etc., can all be big, and, most of the time not desired to be indexed or searched).

Comment 7 David Williams

2006-01-30 11:17:49 EST

Oh, and wanted to remind interested parties of bug 106830. 
While 'search' does not modify the files, I'm wondering if its a user-freiendly thing to allow/imply users could modify (via subsequenet 'replace' operation) a file that had unknown (null) content type. If those happended to really be undeclared binary, we'd be allowing them to shoot themselves in the foot (e.g. if someone doing a wide spread update of "copyright date", or similar. )

Comment 8 Rafael Chaves

2006-01-30 11:24:18 EST

Note that there is no base binary content type, it is the other way around. From IContentTypeManager.CT_TEXT spec:

"All text-based content types ought to be sub types of the content type identified by this string. This provides a simple way for detecting whether a content type is text-based: 

 IContentType text = Platform.getContentTypeManager().getContentType(IContentTypeManager.CT_TEXT);
 IContentType someType = ...;
 boolean isTextBased = someType.isKindOf(text);"

Comment 9 David Williams

2006-01-30 12:23:27 EST

(In reply to comment #8)
> Note that there is no base binary content type.

Ooohhhh, well, thanks. There should be, though. Is 
org.eclipse.core.runtime.content.BinarySignatureDescriber
the only "binary" case provided by base? 

But, this makes me feel all the stronger that user's choice should be a little check box that says "text files only" (its inverse would be "all") and text files only would meean only "well defined text files" ... not also "unknown" types. 
I think this matches users expectations better, allows subsequent "replace" operation to work naturally as expected, and if users wanted more "text" types (e.g. for cpp) they could add to text content type themselves.

Comment 10 Rafael Chaves

2006-01-30 12:57:29 EST

No, I was just pointing out that the syntax for deciding whether a content type is text-based or not was different from what you posted (because there is no root binary content type to conmpare against, instead there is a text root content type - all others are assumed to be binary).

Re: BinarySignatureDescriber - that is an unrelated issue in my opinion. It is there only to help people that want to provide their own binary content types that have signatures at fixed locations (such as .class files' 0xCAFEBABE) by freeing them from having to provide their own and worry about avoiding auto-activation.

Comment 11 David Williams

2006-01-30 13:18:01 EST

Ok. Thanks for clarifying .. so we're left with 

Text content type (or child of ~): text
any other content type not a child of text: binary
no content type at all (null): text or binary (undefined)

So, actually if there is no "zip" or "jar" contentTypes (I dont' think there is?) 
then including the undefined types in with potential text would 
not even help my original observations that prompted this bug report. 
Who knows, maybe its already implemented as stated, and we just need 
to introduce a generic binary content type :)

Comment 12 Rafael Chaves

2006-01-30 13:36:31 EST

You are right, the result will be the same unless you have the option to ignore unknown content types (BTW: bug 106832 requests content types for JAR files, which should be a specialization of ZIP files - should be provided as well).

Comment 13 Martin Aeschlimann

2006-04-06 13:23:29 EDT

fixed > 20060406, thanks Markus for your help.

When searching for '*' we skip all files that are 'binary'.
A file is considered binary if it is not of content type 'TEXT' and its content contains a '/0' character.
To be precice, we read files in pages, and we only test the first page for '0''s, for performance reasons.

Let me know if you have a better heuristic.
So far I'm pretty excited about this change: Search with '*' is performing great now: I therefor changed the Text Search actions to always use '*' (before they used the last used file extensions)

Comment 14 Markus Schorn

2006-04-07 08:30:35 EDT

To be precise, testing only the first 256K is not a performance win, because in case the file is not binary, for determining the length of the CharacterSequence, the complete file is read anyhow! I suggest testing the full file.
There is another option for improving performance: If we do the binary test within the FileCharSequenceProvider we can stop reading the file when hitting the first character that is zero, which usually will be well before the first 256K. The performance gain will not be great, though.
In case you care for one or the other change you may assign the work to me.

Comment 15 Martin Aeschlimann

2006-04-07 08:45:30 EDT

Ok, agree. As long as the reg-ex implementation always does a getLength() first, the file is completly read anyways.
I'm still hoping that they improve this in some future implementation of regex.

For the second suggestion, we would have to measure this:
Reading a block of data is faster than reading one by one. So, maybe it would be worth doing a separate, unrelated pre-read to evaluate if a file is binary. This could be done with a smaller array, and could also ignore charset-decoding. Even if the information read would be lost, it would still bring the file into the filesystem cache and the second, full file read would be faster.

Yes, help is always welcome!

Comment 16 David Williams

2006-04-07 11:47:15 EDT

In my humble opinon, heurtically reading the file at all, looking for '0x00' should not be needed at all. Why is thig being done? Why is not ContentType sufficient? 

Could it be we need more 'text' types defined? (bug 106832) And jar and zip types? (I think they are now, right?) 

Besides being heuristic, my concern about the heuristic is that you are not achiving the savings you otherwise could be. Plus, you are picking this heuristic for *your* search ... what if there were others? What if there were others that were supposed to use or match your resutls? This then, sort of, becomes a "hidden API spec" of behavior that just seems overly ocmplicated to me. I think the problem you are trying to solve with the heuristic could be solved in other ways .... what problem are you trying to solve with the heuristc, exactly?

Comment 17 Martin Aeschlimann

2006-04-07 12:10:46 EDT

I also wished that all text files use context type 'text' or all binary files use something like 'binary'.
The chance to miss some text files and confuse the user with missing matches has to be avoided. We rather report too many matches that missing real ones.
So far the only binary content type I see is 'Java Class File'. Search doesn't know about Java, so we really need the platform to introduce the 'Binary' base type and we can get bigger improvements.

I'm not sure what you mean by 'only a heuristic for *your* search'. This is text search and text is well defined.

Comment 18 David Williams

2006-04-07 12:35:18 EDT

(In reply to comment #17)

> 
> I'm not sure what you mean by 'only a heuristic for *your* search'. This is
> text search and text is well defined. 
> 

Do you mean well defined for you? for eclipse? :) 
Or, maybe you mean its well defined with lots of quirks and possible prefrences settings? 

If you read something that explains the 'grep' command in detail (such as its 'man') you'll get a better sense of what I mean. They too (sometimes) use heuristics for some things, sometimes related to the 'binary' nature of a file, but they alsway have to allow some options to let a user specify more exactly how they want the search done.

Comment 19 Markus Schorn

2006-04-10 05:43:54 EDT

The chances that you can define content-types for all possible files are low. That is true for both binary and text files. Especially files that do not have a file-extension are a problem. If you look at the source code of the Mozilla project you find many examples for files with file-extension. I am listing only those where the extension starts with an 'a':
Text files: hppatch.adb, en-US.aff, makefiles.all, README.alpha, *.amiga, Makefile.am, *.api
Binary files: testembed.aps

When I am searching for some text within the project, I cannot figure out all possible extensions first, and get the content-types right. ==> I need to search files with unknown content-type.

At the same time it is a pain to get text-search matches within a binary file. The heuristics that excludes files containing '\000' helps with that. I agree that the concept of a binary file is unprecise. Although I'd consider a file containing a '\000' to be binary for sure, the heuristics may fail to detect some binary files.

Certainly I'd be good to detect as many binary files as possible by looking at the file-name. So if we can assume that any content type not deriving from IContentTypeManager.CT_TEXT is binary (see Comment 8) we should change the implementation such that those are not searched. Then you can avoid scanning through the binaries by defining content-types, although executables without a file-extension will remain to be a problem.

Comment 20 Dani Megert

2015-09-11 09:48:16 EDT

.

*** This bug has been marked as a duplicate of bug 475602 ***