Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [emf-dev] EMF Compare Name Similarity

Hi,

I'll start by a bit of history.

The original submission of emf compare was in fact made of two different products, one from Intalio and another one from Obeo. Both had a comparison engine and a UI, the one from Obeo was more advanced regarding the engine whereas the one from Intalio was more advanced from an UI perspective. [1] refers to the Intalio product (it was actually published before EMF compare got created). The presentation of this work during the Modeling Symposium in 2006 led to the creation of the project in 2007. At that time it was decided to keep Obeo's engine (which was relying on the dice coefficient ) and Intalio's UI. The Levenstein distance was used by Intalio's engine.

EMF compare 1.3 is indeed similar to [4] and leverage the dice coefficient quite a lot. The matching strategy is quite different in EMF Compare 2.x but still use the dice coefficient.


In a nutshell :

> I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams for string similarity
That's right

> II) EMF Compare 2.x uses the Longest Common Subsequence to determine changes in multi-references of EObjects
That's right, and its used for multi-valued attributes too.

> III) a) is wrong/outdated.

It refers to EMF Compare 1.3 (see the URL) and as such is neither wrong nor outdated but there is no complete description of the 2.x algorithm on the wiki.


Le 05/07/2013 14:53, Simon a écrit :
Hi,

at the moment I am reverse engineering EMF Compare and I've already read much material. I think I found some inconsistencies among the material and want to task if I understand things right.

That are the statements in question:
a) According to [1] EMF Compare uses Levenshtein distance for string similarity. b) According to [3] EMF Compare 1.3 is similar to [4]. In [4] the Dice coefficient (although it is not named explicitly) is used for string similarity.


After a code review of [2] and [5], I came to the following conclusions:
I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams for string similarity II) EMF Compare 2.x uses the Longest Common Subsequence to determine changes in multi-references of EObjects
III) a) is wrong/outdated.

I appreciate if someone can approve my conclusions.




References:

[1] http://eclipsesummit.org/summiteurope2006/presentations/ESE2006-EclipseModelingSymposium10_EMFCompareUtility.pdf

[2] http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare.match/src/org/eclipse/emf/compare/match/internal/statistic/NameSimilarity.java?h=1.3

[3] http://wiki.eclipse.org/EMF_Compare/FAQ/1.3#What_kind_of_.22strategies.22_use_EMF_compare_.3F

[4] http://ase.cs.uni-due.de/olbib/p54-xing-241.pdf

[5] http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare/src/org/eclipse/emf/compare/utils/DiffUtil.java?h=2.1
_______________________________________________
emf-dev mailing list
emf-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/emf-dev




Back to the top