Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[cross-project-issues-dev] Anonymisation of public data

Hello good people,

In the context of the Crossminer research project [1], we plan to publish a number of datasets to the public and for the research community. This includes public data from the Eclipse forge (i.e. data is fetched from public data sources and APIs only), and we want to setup an anonymisation process that would:

* Efficiently and safely remove all personally identifiable data -- we don't want to help spammers or malicious harvesters, and * Still provide valuable information and datasets for the research community -- e.g. ability to identify identical IDs across sources without specifically knowing them.

The basic idea is to simply replace all identifiers with asymmetrically encrypted strings, so all IDs have the same ciphered result. RSA is used for the encryption, and the private key is thrown away once the encoding is done, making it impossible (according to common encryption standards) to retrieve the original string.

A prototype has already been published [2, 3] and we would like to ask people to review it so as to make sure that our privacy-preserving mechanism is safe.

Any feedback, concern or contribution is warmly welcome.

[1] https://www.crossminer.org/
[2] https://github.com/borisbaldassari/data-anonymiser
[3] https://borisbaldassari.github.io/data-anonymiser/

Thanks in advance, have a wonderful week!

--
boris


Back to the top