Bug 468850 - [relengtool] Copyright tool runs slowly for git repositories
Summary: [relengtool] Copyright tool runs slowly for git repositories
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Releng (show other bugs)
Version: 4.5   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 4.12 M3   Edit
Assignee: Paul Pazderski CLA
QA Contact:
URL:
Whiteboard:
Keywords: performance
Depends on:
Blocks: 278683 468502 546474
  Show dependency tree
 
Reported: 2015-05-29 14:25 EDT by Leo Ufimtsev CLA
Modified: 2019-04-16 10:45 EDT (History)
10 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leo Ufimtsev CLA 2015-05-29 14:25:54 EDT
This is related to:
Bug 216797 - [relengtool] Fix copyrights tool run very slowly 

In the bug above, Releng's copyright tool had a memory leak which caused run-time to be exponential. That was fixed. In blank project with 100,000's of files the Copyright tool works fast.

However, Dani Megert discovered another performance issue:

When working with projects that are inside a  repository (git) 
*with a very long history*, 
(e.g org.eclipse.jdt.ui), performance is very slow. (1-2 seconds per file). 

Test case: Run Copyright tool on org.eclipse.jdt.ui 
Expected:  0-5 minutes
Actual:    73 minutes on an SSD disk.

Note:
- If in the Copyright Tool settings you enable "Always use default revision year instead of repository lookup", then the task is executed in ~3 seconds. This it appears that the slowdown only occurs if we need to refer to the git repository.
Comment 1 Leo Ufimtsev CLA 2015-05-29 15:26:45 EDT
The performance issue is caused by:

org.eclipse.releng.tools.git.GitCopyrightAdapter.getLastModifiedYear(IFile, IProgressMonitor)
... 
~67:   final RevCommit commit = walk.next();

For each file that is processed, there is a 700ms delay in this function.

org.eclipse.jgit.revwalk.RevWalk.next() is a method defined in jgit.
Comment 2 David Williams CLA 2015-06-03 14:48:41 EDT
(In reply to Leo Ufimtsev from comment #1)
> The performance issue is caused by:
> 
> org.eclipse.releng.tools.git.GitCopyrightAdapter.getLastModifiedYear(IFile,
> IProgressMonitor)
> ... 
> ~67:   final RevCommit commit = walk.next();
> 
> For each file that is processed, there is a 700ms delay in this function.
> 
> org.eclipse.jgit.revwalk.RevWalk.next() is a method defined in jgit.

Does this mean you should open a bug in JGit? 
(If not a fix, perhaps they'd at least suggest an alternative?)
Comment 3 Leo Ufimtsev CLA 2015-06-03 17:34:50 EDT
(In reply to David Williams from comment #2)
> (In reply to Leo Ufimtsev from comment #1)
> > The performance issue is caused by:
> > 
> > org.eclipse.releng.tools.git.GitCopyrightAdapter.getLastModifiedYear(IFile,
> > IProgressMonitor)
> > ... 
> > ~67:   final RevCommit commit = walk.next();
> > 
> > For each file that is processed, there is a 700ms delay in this function.
> > 
> > org.eclipse.jgit.revwalk.RevWalk.next() is a method defined in jgit.
> 
> Does this mean you should open a bug in JGit? 
> (If not a fix, perhaps they'd at least suggest an alternative?)

Hello,

Sorry for ambiguous comment. 

I investigated, 'git log' is pretty slow in general because it has to traverse the whole commit history to find a commit where the file was modified.

e.g
time find . -exec git log -1 {} \; > /dev/null 2>&1 
takes 13 minutes to run on eclipse.jdt.ui 

Git log behaviour described here:   http://stackoverflow.com/questions/7714434/how-does-git-file-log-internally-works

In the mean time I've emailed jgit folks to ask them if there is a known way around this limitation. I'm going to check back next week if they haven't replied to me by then.

I'll post comments on progress.
Comment 4 Leo Ufimtsev CLA 2015-06-04 10:24:57 EDT
I pinged jgit developers about it and got replies:
(The email has a comprehensive overview of the issue, see also the followup emails):
https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg02899.html

It helps to understand how git works internally:
http://eagain.net/articles/git-for-computer-scientists/

In essence, the issue is that for every file we traverse almost the whole git commit history (assuming many files were not modified in a long time). 
This leads to: 
 F = number of files 
 C = commit history length
O(F * C) = O(n^2)

There is no way to get file modification date without traversing the git-commit history till you find the commit that included the file. This is git by design.

One solution would be to first traverse the git commit history and make a note of when a file was last modified in a hash-table. (e.g during construction of the git adapter. This would be ~O(n) (with some sane assumptions). 
Then during file traversal reference the hash-table instead of re-traversing the git history each time. (walk.next()  -> reference hash table). 
This would also be O(n) (with some sane assumption about hash-table collisions). 
So we'd get O(n) + O(n) = O(n) run time.

This would take a bit of time to implement.

Unfortunately at present there are some pressing issues in SWT that I have to work on. I thus re-assign the task back to inbox.
Comment 5 Dani Megert CLA 2015-06-04 10:55:33 EDT
(In reply to Leo Ufimtsev from comment #4)
> Unfortunately at present there are some pressing issues in SWT that I have
> to work on. I thus re-assign the task back to inbox.

Any idea when you can come back to this?
Comment 6 Leo Ufimtsev CLA 2015-06-04 11:07:08 EDT
(In reply to Dani Megert from comment #5)
> (In reply to Leo Ufimtsev from comment #4)
> > Unfortunately at present there are some pressing issues in SWT that I have
> > to work on. I thus re-assign the task back to inbox.
> 
> Any idea when you can come back to this?

When SWT runs smoother on Gtk3 :-), which might be a loooooonnnggg time.
Comment 7 Leo Ufimtsev CLA 2016-06-29 17:28:05 EDT
From what I gather there is no longer a requirement to update copyright headers.
As such, the copyright tool may not get much attention any time soon.

Perhaps it might make sense to close this bug to reduce bugzilla clutter. Thoughts?
Comment 8 Lars Vogel CLA 2016-06-29 17:31:15 EDT
(In reply to Leo Ufimtsev from comment #7)
> Perhaps it might make sense to close this bug to reduce bugzilla clutter.
> Thoughts?

+1
Comment 9 David Williams CLA 2016-06-29 20:32:47 EDT
-1 to closing. 

Some people will want to use it, so it is still a valid bug. 

I agree though, given the lack of emphasis on "copyright statements everywhere", 
that is is more a 'normal' severity, rather than 'major'.
Comment 10 Dani Megert CLA 2016-08-04 12:45:37 EDT
(In reply to Leo Ufimtsev from comment #7)
> From what I gather there is no longer a requirement to update copyright
> headers.
> As such, the copyright tool may not get much attention any time soon.
> 
> Perhaps it might make sense to close this bug to reduce bugzilla clutter.
> Thoughts?

-1.
Comment 11 Lars Vogel CLA 2018-09-11 17:06:18 EDT
no plans to fix the performance issue atm. Please reopen if you plan to work on this.
Comment 12 Paul Pazderski CLA 2019-04-15 18:16:22 EDT
I cannot reopen this bug myself but I actually worked on it. (before I found this report because I was also annoyed by the slowness)

I run the current 'Fix Copyright' implementation on all files of eclipse.jdt.debug, eclipse.pde.ui, eclipse.platform.debug and eclipse.platform.ui and it took more than an hour to finish (79 minutes). After my change the same projects could be processed in only 2 minutes with precalculation enabled. (and of course both variants produced the same result)
Comment 13 Eclipse Genie CLA 2019-04-15 18:17:34 EDT
New Gerrit change created: https://git.eclipse.org/r/140640
Comment 14 Dani Megert CLA 2019-04-16 03:59:10 EDT
Thanks. Let's see whether Leo or someone in cc has time to look at the change.

Please ping again here if nothing happens for a while.
Comment 15 Lars Vogel CLA 2019-04-16 04:09:42 EDT
Paul, to use the copyright tool again in for Eclipse platform again we need to update it to support EPL 2.0. 

This is Bug 538369, could you have a look now that you already familiar with the code?
Comment 16 Alexander Kurtakov CLA 2019-04-16 04:54:47 EDT
(In reply to Dani Megert from comment #14)
> Thanks. Let's see whether Leo or someone in cc has time to look at the
> change.
> 
> Please ping again here if nothing happens for a while.

Leo, no longer works on Eclipse so he won't look at it.
Comment 17 Thomas Wolf CLA 2019-04-16 05:52:13 EDT
I don't have the time to deal with this right now. From a quick look at https://git.eclipse.org/r/#/c/140640/1/bundles/org.eclipse.releng.tools/src/org/eclipse/releng/tools/git/GitCopyrightAdapter.java it seems to me that all you're trying to do is get the last commit that touched a file.

If so, switch off parent rewriting. That alone will give a huge performance boost. (With parent rewriting on, jgit will indeed read and buffer the full history. If you only want to get the last commit and don't care about the earlier commits, you can safely switch that off, and then jgit will only traverse the history until the last commit that touched the file.)

See https://www.eclipse.org/lists/egit-dev/msg04597.html .
Comment 18 Paul Pazderski CLA 2019-04-16 06:48:50 EDT
(In reply to Thomas Wolf from comment #17)
> If so, switch off parent rewriting. That alone will give a huge performance
> boost.

Unfortunately for me this is true. This brings the same performance improvement as my change and therefore with a minimal code change brings a huge improvement compared to the current situation.

I compared results from disabled parent rewriting and my change by 'copyright fixing' eclipse.platform.ui and got the same results from both. The current implementation is still running.

I abandon my original change but may provide some improvements to the unit test later.
Comment 19 Eclipse Genie CLA 2019-04-16 06:49:45 EDT
New Gerrit change created: https://git.eclipse.org/r/140664
Comment 21 Lars Vogel CLA 2019-04-16 06:58:47 EDT
Copyright tool runs now super fast, thanks Paul. Thanks also to Thomas for the tip.

Paul, please use a new bug for enhancements of the tests.
Comment 22 Leo Ufimtsev CLA 2019-04-16 10:44:38 EDT
(In reply to Alexander Kurtakov from comment #16)
> (In reply to Dani Megert from comment #14)
> > Thanks. Let's see whether Leo or someone in cc has time to look at the
> > change.
> > 
> > Please ping again here if nothing happens for a while.
> 
> Leo, no longer works on Eclipse so he won't look at it.

Blasphemy, don't listen to Alex.

:-D. J.K.

Yea, I moved into Middleware/Java consulting. I still use Eclipse for work when I'm at client, although my current manager doesn't give me time to review eclipse code at the moment. But I kinda keep taps on bugs here and there.

I think the copyright tool was actually one of the first set of bugs I've worked on when I started as Software Engineer. I recall one of the performance issues was caused by a memory leak when a file wasn't closed. I ended up writing an article about finding memory leaks:
https://developers.redhat.com/blog/2014/08/14/find-fix-memory-leaks-java-application/
Comment 23 Leo Ufimtsev CLA 2019-04-16 10:45:44 EDT
(In reply to Lars Vogel from comment #21)
> Copyright tool runs now super fast, thanks Paul. Thanks also to Thomas for
> the tip.
> 
> Paul, please use a new bug for enhancements of the tests.

Thanks Paul. Very nice fix.