Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RevWalk next is slow for git repos that have a long commit history. (2)

See also:

https://github.com/peff/git/blob/jk/blame-tree/blame-tree.c
https://git.wiki.kernel.org/index.php/ExampleScripts#Setting_the_timestamps_of_the_files_to_the_commit_timestamp_of_the_commit_which_last_touched_them

which describe the same concept in the context of C git.

Jonathan Nieder wrote:
I would suggest going the other way around: keep a HashSet remembering which files you have already visited and then walk:

For each commit you see, use a TreeWalk to compare that commit to its parent. For each changed file that has not already been visited, update its copyright header.

Configure the RevWalk to stop when you've hit a commit you've done this search for already.

Le mer. 3 juin 2015 à 13:15, Leo Ufimtsev <lufimtse@xxxxxxxxxx> a écrit :
Hello jgit developers,

I'm working on improving the performance of Eclipse's Releng Copyright fix
tool:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=468850

We ran into the problem that RevWalk.next() is very slow for repositories
that have a long commit history.
e.g eclipse.jdt.ui has 26,000+ commits and 15,000+ files.

I was wondering if this is a known issue and if there is a way to improve
performance or work around it?


------ To be specific:  -------

The tool traverses each file in a project, for each file:
 - it finds it's repository,
 - starting from git's HEAD commit it does a RevWalk.next() backwards through
 history
   to find the commit when the file was last modified.
 - it extracts the year
 - then updates the file's copyright header (2001-2011) -> (2001-2014).

The problem is that RevWalk.next() takes 2-3 seconds per file for
repositories that have very long commit histories (e.g eclipse.jdt.ui has
26,814 commits) and with +15,000 files in a project this operation can take
many hours to complete.

To be specific:
 RevWalk.next()
  -> StartGenerator.next()
    -> FIFORevQueue constructor
     -> 56: BlockRevQueue constructor (Generator s)
        -- the 'for loop' can loop 10k+ times per file.

I found that the native git-log command is also very slow.
E.g calling git-log on 15000 files takes 13 minutes for eclipse.jdt.ui:
'time find . -name "*.java" -exec git log -1 {} \; > /dev/null

(in contrast 'cat-ing' every file takes only 6 seconds:
(find . -exec cat {} \; > /dev/null 2>&1)


Being aware of the git-log limitation, is there some way to e.g cache the
repo and the commit history or find the last-modified date of a file faster
than just traveling the git commit history?

Any advice/tips?

Thank you.

--
Leo Ufimtsev | Intern Software Engineer @ Eclipse Team
_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev

Back to the top