Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [cross-project-issues-dev] Anonymisation of public data

On 27/04/2018 12:57, Gunnar Wagenknecht wrote:
Hi Boris,
Hi Gunnar,

I was one of the people asking off-list because I have a concern with encryption as a technology for anonymizing data. It immediately raises a red flag for me because it allows to de-anonymize the data. Thus, I would like to see use of data masking techniques such as hashing instead of encryption. To be more clear, I find it suspicious why reversible anonymization must be used in the first place.This anonymisation mechanism is not only meant for the Eclipse datasets.
It's meant to be used by other teams and projects too, hence the requirement/feature. In the specific context of the Eclipse datasets, we'll not even *save* the key so it's rather safe, especially considering we're talking about public data.

And I'm not certain hashing is better than encrypting (assuming the key is really thrown away) because of rainbow tables and similar techniques. And since we're talking about public data, cracking the encryption (or hashing) is a *lot* harder than simply reading the public sources.


Can you also be more specific about what public data and which API endpoints you are going to use?
I assume it's anything that is public in Git already, which makes this discussion obsolete as everything is already public. But I want to confirm that non of the API endpoints require authentication to get data you wouldn't get without authentication.
*Every* data we retrieve is public, and I confirm no auth is required to access them. There will be Git, Bugzilla, Forums, CI, SonarQube, mailing lists -- all of which can be accessed by anyone publicly.

This does not make the discussion obsolete, however. Even if the information is public I do NOT want to ease the work of spammers or malicious people (i.e. it'd easier for them to read csvs than git log). Hence the anonymisation, even if it's on public data.


I'd be happy to have reviewers for the datasets, by the way. So if anybody is willing to double check results or be part of the process, please let me know.


One last thing; Crossminer is an EU-funded project, and they pay attention to privacy (especially with the upcoming GDPR), so the context is rather safe, and you actually do not even need to trust me personally. :-)


Once again, your concern and associated feedback are welcome, and i'm happy to discuss that.

Cheers!


--
boris




Best,
Gunnar



Back to the top