Bug 130390 - CamelCase algorithm cleanup and improvement
Summary: CamelCase algorithm cleanup and improvement
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.2   Edit
Hardware: PC All
: P3 enhancement (vote)
Target Milestone: 3.2 RC1   Edit
Assignee: Frederic Fusier CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-03 15:51 EST by Zorzella Mising name CLA
Modified: 2006-04-19 02:38 EDT (History)
2 users (show)

See Also:


Attachments
Patch to re-implement CamelCase (7.44 KB, patch)
2006-03-03 15:52 EST, Zorzella Mising name CLA
no flags Details | Diff
Tuned patch to fit last changes done in CharOperation (20.42 KB, patch)
2006-04-10 08:06 EDT, Frederic Fusier CLA
no flags Details | Diff
Final patch including fixes for some side effects + duplicate algortihm in SearchPatter (31.96 KB, patch)
2006-04-10 13:40 EDT, Frederic Fusier CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zorzella Mising name CLA 2006-03-03 15:51:34 EST
I've been plagued by the fact that Eclipse's CamelCase is not (IMHO) as powerful as IntelliJ's. I created a patch to CharOperation.java to both simplify and make more powerful the algo. I've also added a test that is very indicative of what I think should work...

I realize there is a split between the algo used for auto-completion (this) and another (which I have not found yet) for "Open Type". If the algo I attach to this bug is deemed good, I'd urge you to help me out to make it the one and only used by anything that requires CamelCase.

Z
Comment 1 Zorzella Mising name CLA 2006-03-03 15:52:37 EST
Created attachment 35703 [details]
Patch to re-implement CamelCase
Comment 2 Philipe Mulet CLA 2006-03-03 17:08:53 EST
We will consider it, thanks for the input. The test is very helpful too, since it sets the expectation.

The current implementation mostly meets the basics, if community agrees we can make it more powerful; but we need to keep the rules simple.

Just to clarify things, can you detail the rules you implemented ?
Comment 3 Zorzella Mising name CLA 2006-03-03 18:32:57 EST
CamelCase as implemented is not CamelCase at all, like IntelliJ's... It only accepts capital letters in the beginning, and lowercase in the end. So, while HMap matches HashMap, HaM does not. Easiest way is for me to list all tests that I wrote that will fail under the current algo. 

"." for the ones that will pass,
F is for the ones that will fail, 
? I'm not quite sure 

.            {"TZ","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
F            {"TiZ","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
F            {"TiZon","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TZon","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TZone","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TimeZone","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TimeZ","TimeZ"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TZ","TimeZ"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"T","TimeZ"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"T","TimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
.            {"TZ","TZ"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aT","aTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTi","aTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTiZ","aTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTZ","aTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aT","artTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTi","artTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTiZ","artTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
?            {"aTZ","artTimeZone"},  //$NON-NLS-1$//$NON-NLS-2$
Comment 4 Philipe Mulet CLA 2006-03-04 06:38:07 EST
Well, strictly speaking there is no absolute definition of CamelCase matching. So we have flexibility here. Normally, matching on uppercase letters is the base for CamelCase matching; which is what we implemented.
Comment 5 Zorzella Mising name CLA 2006-03-06 11:24:33 EST
Sorry: I did not mean to imply that there is an absolute right way of doing it, only that:

1) it could be much more useful

2) it is not any harder to do so

3) there is no downside (that I can see)

4) the patch is included

The remark about "CamelCase" is that it seems to me the name comes from the fact that there are SevEraL humps in the query string, rather then the SIngle hump as currently implemented (DromedaryCase?) :^> 

Comment 6 Philipe Mulet CLA 2006-03-06 13:19:42 EST
Ok, we are in agreement.

Dirk - as emeritus CamelCase expert, do you have an opinion on this evolution ?

On the same topic, some also wanted to have CamelCase bettern perform with digits...
Comment 7 Dirk Baeumer CLA 2006-03-13 05:10:04 EST
Two things to remark:

- if we change the algo then we have to adapt SearchPattern.validateMatchRule 
  as well.

- if we add the additional support then we have to think about the impact on
  sorting as well. Consider the following example:

  o user types TiZone and there is a type TiZone
  o since TiZone also matches TimeZone the sorting will result in:
    ...
    TimeZone
    TiZone
    ...
 
    resulting in the fact that an exact match doesn't appear first. If we sort
    TiZone first then the list will not be strictly alphabetical anymore.

In general I like the idea of making CC match more powerful.
    
Comment 8 Zorzella Mising name CLA 2006-03-13 15:12:14 EST
Hmmmm. Does alphabetical sorting do all Uppercases always before any lowercase? I.e., is "ZZ" less than "aa"? If not, the currenct impl is already not guaranteed to prefer exact matches. If you type 'TZ', and there's a class named TZ, it will show up after TimeZone (again, unless uppercases are all first).

As another point, order is tweaked with time to give preference to what the user likes best.

But finally, I agree that a perfect match should always be preferred -- maybe an explicit test for exact matches should always be performed.
Comment 9 Philipe Mulet CLA 2006-04-04 09:05:02 EDT
Frederic - pls take care of it. We want to handle "NuPoEx"
Comment 10 Frederic Fusier CLA 2006-04-10 07:56:42 EDT
I think Dirk concern about sorting is not relevant as OpenType dialog typically sort in alphabetical order and "TiZone" is before "TimeZone" ('Z' is lower than 'm' and all lowercase characters)
=> If user type "TiZone", then he'll get most relevant match on top of list.

Moreover, we currently have the same behavior (ie. the patch does not change anything). If user type "TZ", he'll get "TZ" class before "TimeZone" one...
Comment 11 Frederic Fusier CLA 2006-04-10 07:58:35 EDT
"Zorzella", I would like to have your real first and last names to add you as contributor in copyright comment list, thanks
Comment 12 Frederic Fusier CLA 2006-04-10 08:06:43 EDT
Created attachment 38134 [details]
Tuned patch to fit last changes done in CharOperation

Note that I modified tests for uppercase or lowercase character as it must be splitted in two separated tests to avoid running into Character tests methods systematically.
I also put your test in UtilTests class which was the initial class for this CharOperation tests...

If anything hurts you, please shout...
Comment 13 Frederic Fusier CLA 2006-04-10 09:17:35 EDT
Dirk,
As you can see in patch, SearchPattern.camelCaseMatch(*) methods have been rewritten to call CharOperation method to avoid duplicated algorithms.
If you think this can have a big impact on your performance results then let me know asap and I'll revert this...
Comment 14 Dirk Baeumer CLA 2006-04-10 11:37:26 EDT
Frederic,

the reason why we added the String based methods in the first place was to avoid char creation for every string we compare.

I currently use SearchPattern.camelCaseMatch for the history and the cached result. For example if you type A and the dialog received all matches before you type the next character then I search in memory in the cached result. In big workspaces this can be thousands of types. So the performance impact will be the additional garbage collects.
Comment 15 Frederic Fusier CLA 2006-04-10 11:56:33 EDT
(In reply to comment #14)
> Frederic,
> 
> the reason why we added the String based methods in the first place was to
> avoid char creation for every string we compare.
> 
> I currently use SearchPattern.camelCaseMatch for the history and the cached
> result. For example if you type A and the dialog received all matches before
> you type the next character then I search in memory in the cached result. In
> big workspaces this can be thousands of types. So the performance impact will
> be the additional garbage collects.
> 
OK, so I will put back to SearchPattern duplicated algorithm using String and charAt(int) method...
Comment 16 Zorzella Mising name CLA 2006-04-10 12:33:41 EDT
Frédéric,

my full name is Luiz-Otávio Zorzella. And, yes, I do go by my last name, Zorzella. 

Thanks,

Z
Comment 17 Frederic Fusier CLA 2006-04-10 13:40:57 EDT
Created attachment 38178 [details]
Final patch including fixes for some side effects + duplicate algortihm in SearchPatter

Luiz, please let me know if this patch is OK for you.
I'm currently running tests on it and so, I should be able to release it tomorrow early morning if you agree...
Thanks
Comment 18 Zorzella Mising name CLA 2006-04-10 16:37:50 EDT
Nice... Could you list my email as "zorzella at gmail dot com" or is that against the norm? If so, leave as it is.

Thanks for seeing this through. I can barelly wait to have power-CamelCase!!!

Zorzella
Comment 19 Frederic Fusier CLA 2006-04-11 02:30:57 EDT
Patch released in HEAD (I've changed your email as requested in copyright comment).
This would be available in next nightly build (e.g. N20060412-0010) tomorrow morning.
Thanks again for your contribution :-)
Comment 20 Frederic Fusier CLA 2006-04-11 02:32:27 EDT
You should read "This _will_ be available..." in previous comment...
Comment 21 Jerome Lanneluc CLA 2006-04-13 11:01:50 EDT
Verified for 3.2 RC1 using build I20060413-0010 
Comment 22 Frederic Fusier CLA 2006-04-19 02:38:10 EDT
[contributed patch applied]