84224 – Need advice for finding duplicate Types

Bug 84224 - Need advice for finding duplicate Types

Summary: Need advice for finding duplicate Types

Status:	VERIFIED FIXED

Alias:	None

Product:	JDT
Classification:	Eclipse Project
Component:	Core (show other bugs)
Version:	3.1
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	3.1 M7
Assignee:	Kent Johnson
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Depends on:
Blocks:

Reported:	2005-02-02 09:39 EST by Dirk Baeumer
Modified:	2005-05-11 09:14 EDT (History)
CC List:	2 users (show)

See Also:

Attachments
the algorithm in code (2.57 KB, text/plain) 2005-02-02 11:31 EST, Martin Aeschlimann	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dirk Baeumer

2005-02-02 09:39:15 EST

To get rid of the all types cache we have done several performance measurements
how to best calculate the number of duplicate types names (e.g. same type name,
but in different package). This information is essential for the organize import
actions and the import rewriter.

In a workspace containing all SDK plug-ins there are 1798!! duplicate type names:

The different searches we tested.

- searching for all types in a given scope. For example all types visible in 
  org.eclipse.jdt.ui. This reports 20671 matches and takes between 1.31 s and   
  1.86 s without us doing any analysis to find duplicate types.

- we search for all type names in a given type-container. A type container is
  either a package or a type with it package qualification if it is a nested
  type (e.g. pack.Type.InnerType the type container is pack.Type). Such a search
  doesn't exist yet, but we simulated it with a package as a type container. 
  The result for the package "org.eclipse.jdt.internal.ui.text" is 89 types and   
  takes between 220 ms and 250 ms.

  For the organize import we would search for all type containers which have
  a star import (e.g. end with '*'). Doing this in a loop we will spend the same 
  time as for the all types search when searching for 4 - 5 type containers. If
  clients use star imports they will easily have 4 - 5 star imports.

- We search for duplicate types for all types that are used unqualified in the
  source code to see if we have more than one match. For Assert in the scope
  of org.eclipse.jdt.ui we get 13 matches. This takes between 140 and 190 ms.


The current results aren't in a range where we can convert the import structure 
to use the search engine without having a noticable performance degradation. So
the question is:

- is there a better search/algorithm we can use ?
- can we add a special query to the search engine which than can be optimized ?
  For example a search query that reports all types in an array of type 
  containers so that we only scan the index once.

Comment 1 Kent Johnson

2005-02-02 11:07:55 EST

"We search for duplicate types for all types that are used unqualified in the 
source code to see if we have more than one match."

Why? You are not going to remove every single type import, are you?

From the bindings you should know which types are resolved to be from which 
packages & as a result from which existing on demand imports (if any).

I can see us adding the multi-declaration query, but I'm wondering how often 
you need to call it.

Also on the performance side, do you create a DOM AST for every file? And 
change/save every file? If these 2 operations take 2 seconds on average then 
does it matter if the search takes 200ms vs. 500ms?

Comment 2 Martin Aeschlimann

2005-02-02 11:09:37 EST

I updated the import rewriter to be able to switch between all type cache (of
all types in the workspace) and use of type search in the project.

I observed the same performance numbers as Dirk in comment 0.
On a real world example that means:

Organize imports on all (89) types of org.eclipse.compare (total 16497 types on
class path)
with all types cache (empty on startup): 14 s (initial search 4 s, then av. 0.15 s)
with project search: 123 s (av. 1.4 s per search in project)

Comment 3 Martin Aeschlimann

2005-02-02 11:24:57 EST

The algorithm is:
Given a set of type container names that will be star imported { pack1, pack2 }
and all the simple type names that are in these containers and are referenced in
the source code {A1 (from pack1), A2 (from pack1...), B1, B1 }.
Iterate over all types in the project and find if there is another type that has
one of the simple type (e.g. A1} name AND a container in our set but which is
not the one we reference { e.g. pack2.A1 }

I could give you the set of container names and/or the set of type names. But I
wonder if that would be much faster than having me looking at all types myself.
There is no real work done in the result call back, no element creation.

The only solution I see that there is something kind of cache of all types in
the project. I tought code assist uses something like that?

In my example in comment #2, there is an AST build for each of the files (89
source files) and the file is changed/saved every time. You see that the search
can't compare with the time the AST build takes.

Comment 4 Martin Aeschlimann

2005-02-02 11:31:11 EST

Created attachment 17632 [details]
the algorithm in code

Comment 5 Kent Johnson

2005-02-02 11:52:27 EST

"Iterate over all types in the project and find if there is another type that 
has one of the simple type"

Why?

It seems to me that you only need to search for collisions when you remove a 
single type import (because you added an on-demand import to cover it). So for 
each single type import being removed, you must check that it does not exist 
in any of the other on-demand imported packages.

I see no reason to ever ask for every type defined in a package or project.

Comment 6 Martin Aeschlimann

2005-02-02 12:14:28 EST

To find if there is another type with the same simple name I currently iterate
over all the types I got from the search engine (once), unless I want to do a
search for each type.
This is needed for all referenced types in the (new) on-demand package. That's
mostly more than one. It even has to happen for all type references in all other
(unchanged) on-demand packages.  
I would be very surprised if several searches are faster than one.
But I can implement that if you think its worth it.

Comment 7 Kent Johnson

2005-02-02 12:23:37 EST

I definitely don't expect it to be faster (now).

But if you could get to the point that you're asking for declarations of:
- all referenced types in the new on-demand package
- all referenced types in the other unchanged on-demand packages

AND the scope is limited to these on-demand packages and not the workspace or 
the project...

Then I think we can combine these type declaration queries into a single query 
that will have decent enough performance.

Comment 8 Martin Aeschlimann

2005-02-02 12:31:34 EST

The problem with the scope is that it has to be constructed out of IJavaElements.
I don't have the Java elements for the package and the types (in the case of
type containers). There's potentially more than one IPackageFragment for the
same package (e.g. in the multiple source folder scenario).

Comment 9 Kent Johnson

2005-02-02 12:40:17 EST

OK - well then maybe we need to define a multi-searchAllTypeNames that lets 
you pass in mutiple simple type names AND multiple qualified package names.

The scope would be the project.

The answer would be every type declaration whose simple type name is in the 
simple name set & whose package is the qualified set.

Something like:

searchAllTypeNames(
  final char[][] packageNames, 
  final char[][] typeNames,
  final int matchRule, 
  int searchFor, 
  IJavaSearchScope scope, 
  final ITypeNameRequestor nameRequestor,
  int waitingPolicy,
  IProgressMonitor progressMonitor)

Comment 10 Martin Aeschlimann

2005-02-02 12:48:06 EST

Great, if with 'package name' you mean 'type container name'.

I'm wondering, can the search engine really speed up knowing all that? Or would
it just iterate over all types as well and do the same checks I just did?
The would just save the call backs and they can't be so expensive.

Comment 11 Kent Johnson

2005-02-02 13:15:30 EST

It will do it as a single query so each index file is only opened once.

So assuming the time is ok for a project search of a single type in a 
qualified package/container...  then the combined time for several types in 
several containers should be a bit more but not much if its still a single 
index pass.

We would like to limit the number of index files further but we'll look into 
limiting the scope.

At the very least, say there are 3 containers but 15 type names: its likely 
worth it to see if the containers exist in the index file before checking for 
each type name.

Comment 12 Kent Johnson

2005-03-15 14:43:48 EST

So how is this?

For organize imports, we add a call to SearchEngine:

void searchAllTypeNames(
	final char[][] qualifiedNames, 
	final char[][] typeNames,
	final TypeNameRequestor nameRequestor,
	int waitingPolicy,
	IProgressMonitor progressMonitor)

It performs a case sensitive exact match to find all types whose name is 
included in typeNames & whose packageName + enclosingTypeName is included in 
qualifiedNames.

It answers the matching types using:

TypeNameRequestor.acceptType(modifiers, packageName, simpleTypeName, 
enclosingTypeNames, path)



For the open type dialog, we add a couple calls to SearchEngine or a new 
object such as GlobalTypesCache:

void searchAllTypeNames(
	final char[] typeName,
	int numberOfResultsToValidate,
	final TypeNameRequestor nameRequestor,
	int waitingPolicy,
	IProgressMonitor progressMonitor)

It performs a case insensitive pattern match to find all types whose name 
matches typeName.

It will call with first numberOfResultsToValidate matching types (sorted by 
typeName):

TypeNameRequestor.acceptTypeName(modifiers, packageName, simpleTypeName, 
enclosingTypeNames, resourcePath)

with remaining matching types, it will call:

TypeNameRequestor.acceptPossibleTypeName(modifiers, packageName, 
simpleTypeName, enclosingTypeNames, indexPath)

To validate that a possible matching type name still exists, you would need to 
find its resourcePath by calling:

String locateType(
	char[] typeName,
	char[] packageName,
	char[][] enclosingTypeNames,
	String indexPath)

The resulting path string would be null if the type can no longer be located.

Comment 13 Kent Johnson

2005-03-15 14:46:17 EST

Actually change that, the 2 methods on TypeNameRequestor would be the existing 
method:

TypeNameRequestor.acceptType(modifiers, packageName, simpleTypeName, 
enclosingTypeNames, resourcePath)

& a new method:

TypeNameRequestor.acceptPossibleType(modifiers, packageName, 
simpleTypeName, enclosingTypeNames, indexPath)

Comment 14 Dirk Baeumer

2005-03-16 09:18:41 EST

Kent, 

up to now I only looked into getting the open type dialog implemented on top of
search. However, we expose some of it functionality as public API. Furthermore
we know more about the pattern (for example if it is exact or a pattern
containing wild cards). So I would like to extend the API as follows:

- a flag to control if the match should be case sensitive or case insensitive
  and if we want to do a pattern match. I know this information from the pattern   
  the user typing in and if it speeds up things I can provide the information. 

- same for a package name. The user can provide a package name in the dialog
  which I could pass to the search query if the information can be used to
  filter things directly in the search engine.

- the public API allows contraining the elements in the dialog to a certain
  IJavaSearchScope. 

Regarding case sensitive and pattern match I can refilter the results. Same for
the package name since I get enough information in the call back. The scope
looks a little bit more compilcated. To decide if a type is enclosed in a scope
I need its path to call IJavaSearchScope#encloses(String resourcePath). Since I
don't have this information for non validated entries I have to postpone the
validation until I present them. This might for a scope that aonly matches a
handful of types cause performance problems since I have to validate them all as
soon as they get visible.

Kent, is there a way to honor the scope during the search ?

Comment 15 Kent Johnson

2005-03-16 10:52:40 EST

"we expose some of it functionality as public API"

what does it look like?

Comment 16 Dirk Baeumer

2005-03-16 11:21:35 EST

An example is JavaUI#createTypeDialog(Shell parent, IRunnableContext context,
IJavaSearchScope scope, int style, boolean multipleSelection, String filter).

Here a client can pass a search scope and a filter. The filter can contain
wildcards and package components (although this isn't speced this way we treat
it this way).

Comment 17 Kent Johnson

2005-03-18 17:11:52 EST

Martin, I released the call in SearchEngine to handle multiple qualifications 
& typeNames.

SearchEngine.searchAllTypeNames(
  final char[][] qualifications, 
  final char[][] typeNames,
  IJavaSearchScope scope, 
  final TypeNameRequestor nameRequestor,
  int waitingPolicy,
  IProgressMonitor progressMonitor)

Comment 18 Martin Aeschlimann

2005-03-23 12:51:02 EST

Just tried the new API. Good performance. A average search takes around 200ms
per file, resulting in a good speed for doing organize imports on the whole project.

I would suggest I switch to the new API!

Comment 19 Kent Johnson

2005-03-23 16:04:13 EST

Martin, are you using the project scope to limit the number of index files 
that need to be checked?

Comment 20 Martin Aeschlimann

2005-03-24 04:41:45 EST

Yes, a project scope is used.

Comment 21 Kent Johnson

2005-04-27 14:16:43 EDT

This PR is now done since the AllTypesCache has been removed.

We added the new API for organize imports. The Open Type dialog benefits from 
the caching of the category tables.

Comment 22 Olivier Thomann

2005-05-11 09:14:49 EDT

Verified in I20050510-0010