182337 – [Content Type] ContentTypeMatcher does too much work

Bug 182337 - [Content Type] ContentTypeMatcher does too much work

Summary: [Content Type] ContentTypeMatcher does too much work

Status:	VERIFIED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Resources (show other bugs)
Version:	3.3
Hardware:	PC Windows XP

Importance:	P3 major (vote)
Target Milestone:	3.6 M3
Assignee:	Pawel Pogorzelski
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Duplicates (1):	270789 (view as bug list)
Depends on:
Blocks:

Reported:	2007-04-13 11:08 EDT by Dani Megert
Modified:	2010-02-09 08:54 EST (History)
CC List:	11 users (show)

See Also:

Attachments
Patch_v01 (27.19 KB, patch) 2009-09-22 11:21 EDT, Pawel Pogorzelski	no flags	Details \| Diff
Patch_v02 (25.52 KB, patch) 2009-09-24 11:53 EDT, Pawel Pogorzelski	no flags	Details \| Diff
Patch_v03A (25.87 KB, patch) 2009-09-25 09:00 EDT, Pawel Pogorzelski	no flags	Details \| Diff
Patch_v03B (3.67 KB, patch) 2009-09-25 09:01 EDT, Pawel Pogorzelski	no flags	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dani Megert

2007-04-13 11:08:46 EDT

While investigating the performance regression in JDT Text test:
  performance.ContentTypeTest#testPluginXMLDirty()

it turned out that even though we pass in the exact file name ("plugin.xml") to 
IContentTypeMatcher.getDescriptionFor(...) the code tries all registered xml content types that are registered for "xml" extension as well, even though the pluginManifest type is registered with 'high' priority for "plugin.xml":
     <content-type 
      		id="pluginManifest" 
      		name="%pluginManifestName" 
    		base-type="org.eclipse.core.runtime.xml"
		priority="high"				
    		file-names="plugin.xml">
    	<describer class="org.eclipse.core.runtime.content.XMLRootElementContentDescriber">
    		<parameter name="element" value="plugin"/>
	</describer>

Is this intended?

Now it goes on: since most registered XML content types use a XMLRootElementContentDescriber this becomes quite expensive: the given contents is read-in by each content describer. In addition it seems that all describers are queried even if a match has been found.

While this code was the same in 3.2 we now get a regression all over the place when it comes to XML files because two new content types got added during 3.3 that are sub types of xml and registered for file extension "xml" with a XMLRootElementContentDescriber. E.g. all operations that need the charset of a file (IFile.getCharset()) take longer.

Comment 1 Dani Megert

2007-04-13 11:23:05 EDT

This is something we should investigate for 3.3 since it affects all operations that read xml files which are registered with its full file name (e.g. plugin.xml). For instance PDE builder could take a penalty here.

Comment 2 Dani Megert

2007-04-13 11:25:26 EDT

Note to Wassim: the PDE editor opening perf test is probably not affected by this as I guess it won't query the charset that often.

The two new charsets with the describer are:
  org.eclipse.pde.simpleCheatSheet
  org.eclipse.pde.compositeCheatSheet

Comment 3 Dani Megert

2007-04-13 11:34:14 EDT

Note that the describer does not have a weight, just VALID, INVALID and INDETERMINATE. So, I'd expect to first check the plugin.xml describer and if this is valid do no further work.

Comment 4 Dani Megert

2007-04-17 10:14:06 EDT

At least the encoding is
- cached until next change to the file
- file is read through an ILazySource which only opens the file once

Comment 5 John Arthorne

2007-04-17 11:16:54 EDT

Oleg, would you have a chance to look at this if you are still doing perfomrance work?  Rafael was the only maintainer of this code, so I don't know much about it. However, from a quick look at the code (org.eclipse.core.contenttype), I'm not sure if this can be optimized easily because there is an ISelectionPolicy API that clients may implement.  First it computes the set of possible content types, and then the policy gets to pick which one is best.

Comment 6 Dani Megert

2007-04-17 11:27:39 EDT

The matcher is pretty vague in telling when the ISelectionPolicy is actually used. We could first check the "perfect" content type(s) and only if the policy doesn't want any of them go through the less relevant candidates.

Comment 7 Pascal Rapicault

2007-04-17 14:51:00 EDT

How much slower (in ms) are we? 
Is there a visible impact on the end user?

Comment 8 Dani Megert

2007-04-18 03:49:57 EDT

You can look at the test case for numbers (it calls getDescription 1000 times and this is about 800ms [45%] slower). The problem surfaces when many xml files get touched and hence the getCharset() is recomputed for them.

And of course the problem increases for products like WTP who probably also add several content types with a content describer.

I think the only thing we might be able to improve is the perfect (file name) match: I really don't see why we go for all *.xml content types if it perfectly matches to 'plugin.xml'.

Comment 9 Oleg Besedin

2007-04-25 11:38:31 EDT

John (comment 5) is so right it is scary :-).

Anybody who knows this code is welcome to correct me, but from what I see the concept of "perfect" match is worked gradually and is not apparent until all describers have been queried.

- The order in which describers are queried is semi-random. Priority does not play role in it.

- Priority is one of the elements on which results are sorted; depth overrides priority in ordering of results.

- The query utilized in those tests uses a non-null ISelectionPolicy that favors "associated content types". 

Due to the last point, we need to collect all matches and pass it to that selection policy. The irony is that plugin.xml files are always associated (well, always in the current version of Eclipse), but content type has no knowledge of it. 

Another point to consider is that XMLRootElementContentDescriber have to open the file to match the top tag. (In case of manifest files it is not enough to have the name "plugin.xml", but the top tag should be named "plugin".) So the file will be opened and portion of it will be read even if we have only one content describer.  

As a point of interest, from the numbers you have, the extra processing time is 0.8ms/call. The access time for a typical hard drive is somewhere on the order of 1ms to 5ms. Meaning that extra reads in this test [for the most part] happen from the cache.

Comment 10 Dani Megert

2007-04-25 12:13:18 EDT

I agree that it's scary, so maybe we just have to live with this.

Comment 11 Oleg Besedin

2007-04-25 17:01:21 EDT

Any other opinions?

I would to keep this bug open but move it into post-3.3:

- no simple solution seems to exist
- the slowdown is 0.8ms per call - hardly noticeable to a user

but:
- this highlights a potential scalability problem with content types mechanism

I propose to leave this bug open and remove the milestone. In this case when there is a practical need or time available this bug can be revisited.

Comment 12 John Arthorne

2007-04-25 17:44:14 EDT

Agree. This may become more of an issue in the future, or in large distributions, if the number of content type describers grows.

Comment 13 Randall Theobald

2008-11-11 08:38:16 EST

I am a performance analyst for a large adopter (a couple layers on top of WTP). We are trying to finish up our release that sits on Eclipse 3.4.1 and are definitely seeing this as a performance issue (e.g. see bug 247689). Perhaps I can add a bit to this discussion and maybe generate some new ideas.

The bug is scary, I think, because this whole API of content describers is open to implementators to implement, but is not scalable to the number of implementors. Each implementor could potentially parse the data in a different way. On top of that problem, clients often use the describer API as a router for resource changes--where the destination will probably access the resource again (another parse) and take some action.

The only way I see to improve the situation is to modify this whole API such that the majority of cases require no actual opening or parsing of the file. Obviously, this would be a long-term approach. Shorter-term approaches involves incremental fixes such as bug 247689, and investigation into reusing parsers in XMLRootHandler (for our workloads, the majority of time in the content describer APIs are simply creating new SAX parsers in XMLRootHandler)--which I haven't opened a bug for (yet).

Comment 14 Randall Theobald

2008-11-11 11:54:49 EST

Another possible improvement would be as follows.

Often, clients use the content description API to see if a file matches a certain content type. Unfortunately, they have to GET the content description for the file before this comparison can be made (even when a simple filename comparison could suffice). Thus, introducing an API that could quickly query whether a given file matches a given content type, where the algorithm only does the bare minimum to return the answer (e.g. if a filename comparison results show a non-match, don't read in the file using the content describers, etc.), would be beneficial (once clients convert to using this API, of course).

Comment 15 Dani Megert

2009-04-01 10:24:58 EDT

*** Bug 270789 has been marked as a duplicate of this bug. ***

Comment 16 Szymon Brandys

2009-04-01 12:08:53 EDT

(In reply to comment #14)
Randall, are you still willing to help here? I think that this would be a good candidate to investigate and fix during 3.6.

Comment 17 Randall Theobald

2009-04-01 12:20:25 EDT

I can help if needed. However, I'm not currently in a position to proactively research/push the issue.

Comment 18 Szymon Brandys

2009-04-22 08:32:22 EDT

The performance was slightly improved as stated in bug 247689. Investigating further fixes as suggested in comment 13 and comment 14.

Comment 19 Szymon Brandys

2009-04-29 09:35:35 EDT

It is a good candidate for 3.6 M1 or M2.

Comment 20 Szymon Brandys

2009-09-16 04:49:32 EDT

Pawel started looking at it during M2 and hopefully he'll provide a fix during M3.

Comment 21 Pawel Pogorzelski

2009-09-22 11:21:55 EDT

Created attachment 147787 [details]
Patch_v01

I prepared a POC patch for this particular problem. The proposal is not general and improves only XMLContentDescriber, XMLRootElementContentDescriber and XMLRootElementContentDescriber2 classes.

It works by passing a properties map from ContentTypeCatalog.internalFindContentTypesFor() downstream to content describers. Each describer checks if the map already contains properties it needs and if that's true it doesn't parse the content.

The fix won't introduce new API nor brake any. In my opinion solving more general problem takes decoupling the parsing step from parser's output interpretation. Right now each describer does both what is painful in terms of performance.

After applying the patch ContentTypeTest.testPluginXMLDirty() executes about 65% faster.

Any comments are welcome. However keep in mind it's merely a POC (a rough one).

Comment 22 Szymon Brandys

2009-09-23 05:41:48 EDT

I like the approach, however we should be sure that all tests pass. The patch breaks 9 content type tests.

Other comments are:
- fillContentProperties(InputSource input, Map properties) methods in the XMLRootElement describers could be replaced with one method 
- the same thing about new static constants (i.e. DTD, ELEMENT etc.)
- two fillContentProperties methods in XMLContentDescriber can be probably merged

Yet another thing is to raise a bug for more general solution we discussed recently.

Comment 23 Pawel Pogorzelski

2009-09-24 11:53:12 EDT

Created attachment 148026 [details]
Patch_v02

(In reply to comment #22)
> I like the approach, however we should be sure that all tests pass. The patch
> breaks 9 content type tests.

I refined the patch and now all the content type tests pass.

> Other comments are:
> - fillContentProperties(InputSource input, Map properties) methods in the
> XMLRootElement describers could be replaced with one method 
> - the same thing about new static constants (i.e. DTD, ELEMENT etc.)

Done.

> - two fillContentProperties methods in XMLContentDescriber can be probably
> merged

I think the current approach is more appropriate.

> Yet another thing is to raise a bug for more general solution we discussed
> recently.

Done, it's bug 290431.

Comment 24 Dani Megert

2009-09-24 11:58:52 EDT

>After applying the patch ContentTypeTest.testPluginXMLDirty() executes about
>65% faster.
Nice!

Comment 25 Pawel Pogorzelski

2009-09-25 09:00:08 EDT

Created attachment 148104 [details]
Patch_v03A

I've added @noreference to methods on XMLRootElementContentDescriber and XMLRootElementContentDescriber2.

Comment 26 Pawel Pogorzelski

2009-09-25 09:01:11 EDT

Created attachment 148105 [details]
Patch_v03B

Test.

Comment 27 Szymon Brandys

2009-09-28 08:52:38 EDT

Both Patch_v03A and Patch_v03B released to HEAD.

Comment 28 Pawel Pogorzelski

2009-10-05 09:11:44 EDT

Verified based on I20090929-0800 performance tests results.