Bug 346831 - Provide a 'Find duplicate code snippets' action
Summary: Provide a 'Find duplicate code snippets' action
Status: ASSIGNED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: UI (show other bugs)
Version: 3.7   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Deepak Azad CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-23 03:22 EDT by Deepak Azad CLA
Modified: 2012-10-09 09:24 EDT (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Deepak Azad CLA 2011-05-23 03:22:34 EDT
Extract method refactoring finds duplicate code snippets within the same type. We can use the same code (org.eclipse.jdt.internal.corext.refactoring.code.SnippetFinder) to provide a 'Find duplicate code snippets' action which can find duplicates in the same package, project or workspace.

Invoking this action can bring up a dialog which provides the following options
- Search in
  () Type
  () File
  () Package
  () Project
  () Workspace
- Matching rules
  [] Ignore variable names
  [] Ignore field names
  [] Ignore method names
  etc

The result can then be shown in the search view.

I quickly wrote some code to find duplicates within the same package using SnippetFinder. While this mostly works, we will have to tweak and generalize SnippetFinder. Also the action involves creating ASTs, hence the performance of finding duplicates in the entire workspace (or even within the project) will not be good, though I don't think finding duplicates in the whole workspace is of much use.
Comment 1 Dani Megert CLA 2011-05-23 03:28:49 EDT
Nice idea.

Did you also think of allowing to parametrize the search i.e. which parts must be exactly the same and which can vary?

Also, we might want to investigate whether this could better/faster be offered by JDT Core.
Comment 2 Deepak Azad CLA 2011-05-23 07:13:12 EDT
(In reply to comment #1)
> Did you also think of allowing to parametrize the search i.e. which parts must
> be exactly the same and which can vary?

I did mention in comment 0
> - Matching rules
>   [] Ignore variable names
>   [] Ignore field names
>   [] Ignore method names
>   etc

SnippetFinder already parametrizes variable names, I am not sure how much more we want to parametrize. For example, overloaded method calls could be considered - e.g. foo(int) and foo(String). But I have not given this too much thought yet.
Comment 3 Dani Megert CLA 2011-05-23 07:15:29 EDT
Ignoring comments is probably also a good one.
Comment 4 Marcel Bruch CLA 2011-05-24 05:55:17 EDT
Just a few questions that come in my mind:

What granule of duplication do you have in mind? 

Do you plan to search for exact matchings of the extracted AST?

For instance, when searching for 

Button b = new Button(..)
b.setText();

would it also match:

Button b= new Button(..)
Griddata gd = ..
b.setlayoutData(..)
b.setText();


Is the order of method calls relevant?

i.e., would:
 
Button b= new Button(..)
b.setlayoutData(..)
b.setText();

also match:

Button b= new Button(..)
b.setText();
b.setlayoutData(..)



I think having support for finding similar code snippets is a great idea. Do you consider to create a plug-in mechanism for this? Since there is a large body of work in similar topics in research, I could imagine that providing an extension point that allows tool extensions to be integrated could be beneficial for Eclipse. One may also define a benchmark for this as research challenge? This sounds a bit blue-eyed. Do you have experiences on inviting research for such kind of work?
Comment 5 Deepak Azad CLA 2011-05-24 08:06:05 EDT
(In reply to comment #4)
> For instance, when searching for 
> 
> Button b = new Button(..)
> b.setText();
> 
> would it also match:
> 
> Button b= new Button(..)
> Griddata gd = ..
> b.setlayoutData(..)
> b.setText();
> 
> 
> Is the order of method calls relevant?
> 
> i.e., would:
> 
> Button b= new Button(..)
> b.setlayoutData(..)
> b.setText();
> 
> also match:
> 
> Button b= new Button(..)
> b.setText();
> b.setlayoutData(..)

Interesting ideas! The logic in extract method refactoring does not handle these cases though. 
 
> Since there is a large
> body of work in similar topics in research,
If you are aware of some research papers which talk about finding code duplicates, can you share them here?
Comment 6 Markus Keller CLA 2011-05-24 09:33:52 EDT
We shouldn't push too many features into this action/refactoring. We should stay with finding duplicates where we can be reasonably sure that they are really the same (similar to what Extract Method does today, but add an option to widen the scope for duplicate search).

Allowing intermediate method calls is already quite advanced and would also touch topics like the "fuzzy" AST search we've already experimented with (but gave up because the fuzziness is hard to define and a tool that often doesn't do what you want is not that worthwhile).

Ignoring comments and local variable names is fine, but ignoring field and method names is definitely not semantically preserving.
Comment 7 Marcel Bruch CLA 2011-05-25 03:43:38 EDT
(In reply to comment #5)
> > Since there is a large
> > body of work in similar topics in research,
> If you are aware of some research papers which talk about finding code
> duplicates, can you share them here?

Just a few papers we are aware of here in our group:

FSE 2010
* Instant code clone search

ICSE 2009
* Complete and accurate clone detection in graph-based models
* Do code clones matter?
* CloneDetective - A workbench for clone detection research.

ICSE 2008
* Scalable detection of semantic clones
* Clonetracker: tool support for code clone management.

ASE 2009
* Clone-Aware Configuration Management

ASE 2008 
* Cleman: Comprehensive Clone Group Evolution Management.

http://students.cis.uab.edu/tairasr/clones/literature/ lists many things including recent papers.


I agree with Markus that JDT should "offer only what works". But in the face of this large and recent body of research I still think that (i) providing a simple extension point for this kind of work with (ii) a good and stable JDT baseline implementation, (iii) allowing others to contribute an replacement would be a great option.