532300 – Performance issue with add call

Bug 532300 - Performance issue with add call

Summary: Performance issue with add call

Status:	REOPENED

Alias:	None

Product:	JGit
Classification:	Technology
Component:	JGit (show other bugs)
Version:	4.10
Hardware:	PC Windows All

Importance:	P3 major (vote)
Target Milestone:	5.0
Assignee:	Project Inbox
QA Contact:

URL:
Whiteboard:
Keywords:	api

Depends on:	388582
Blocks:
	Show dependency tree

Reported:	2018-03-12 04:11 EDT by Vinit Gupta
Modified:	2018-07-20 02:46 EDT (History)
CC List:	3 users (show)

See Also:	Gerrit Change Git Commit

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Vinit Gupta

2018-03-12 04:11:58 EDT

Comment 1 Vinit Gupta

2018-03-12 04:17:00 EDT

My local repository already has more than 10,000 objects. I am seeing performance issues while using the git.add() command to add one more file to the index. Below is the JGit code snippet which I am using to interface my Java program with Git:

String absoluteLocalGitPath = "c:\\localGitRepo\\.git"
FileRepositoryBuilder repositoryBuilder = new FileRepositoryBuilder();
repositoryBuilder.setMustExist(true);
repositoryBuilder.setGitDir(new File(absoluteLocalGitPath));
repository = repositoryBuilder.build();
git = new Git(repository);

AddCommand addCommand = git.add();
addCommand.addFilepattern("folder1/obj10001.obj");
addCommand.call();
Here path passed in file pattern is the relative path to c://gitLocalRepo.

I did some profiling and found org.eclipse.jgit.internal.storage.file.ObjectDirectoryInserter.int(int,long,inputStream) 
org.eclipse.jgit.treewalk.WorkingTreeIterator.getEntryContentLength() org.eclipse.jgit.treewalk.TreeWalk.enterSubtree() 

takes most of the time while using add api.

Comment 2 Christian Halstrick

2018-03-12 09:27:53 EDT

The add operation is the most performance intensive operation we have in a typical workflow. The file content is read and added to the git object database. I guess that you mean that

  org.eclipse.jgit.internal.storage.file.ObjectDirectoryInserter.insert(int,long,inputStream)

consumes most of the time (you wrote it's ...ObjectDirectoryInserter.int(... which doesn't exist). This means that adding the file content to the git database is taking the time.

How much time does it take to add a file and how big is the file?
How long does a native git add operation take?

Tip: you don't have to deal with FileRepositoryBuilder. Just do

  Git git=Git.open(absoluteLocalGitPath)

Comment 3 Vinit Gupta

2018-03-12 12:32:26 EDT

In my case time taken is directly propositional to number of object in Local repository. I am suspecting even after providing exact file name in file pattern (which i want to add to index) jgit looks for all file which was modified since last commit. My file size is less then 50 MB. This add call after adding 10000 object takes approximately 3 minutes but native git takes seconds to completes this operation. Am i committing any mistake while passing file name in file pattern?

Comment 4 Christian Halstrick

2018-03-16 07:26:24 EDT

I think Thomas found the real reason for this in bug #388582. It seems that although
you specified a specific path JGit will internally still visit all the other files (not only those changed since last commit). Those other files are not added in the end, but visiting them, checking whether they are modified or not is costly. I think we have to fix our FileTreeIterator.

Comment 5 Thomas Wolf

2018-03-16 12:12:16 EDT

(In reply to Christian Halstrick from comment #4)
> the real reason for this in bug #388582.
> ...
> I think we have to fix our FileTreeIterator.

Yes, this would also affect adding. Any kind of filter comes too late.

But I fear this cannot be solved in isolation in FileTreeIterator. The iterator will have to know about an associated DirCacheIterator, and will have to know about filters. It rather looks like a redesign of this whole Iterator-TreeWalk-Filter combo is needed. But maybe someone more familiar with this code can see a better way.

As a stop-gap measure it might perhaps already help if one could delay calling directory.listFiles() and FS.getAttributes() until it's actually needed
(which is probably never for files and directories filtered out in TreeWalk via gitignore or due to a PathFilter). Maybe only delaying the FS.getAttributes()
would already help.

Comment 6 Thomas Wolf

2018-03-16 18:34:33 EDT

(In reply to Thomas Wolf from comment #5)
> Maybe only delaying the FS.getAttributes() would already help.

Gave this a try, but the code is a bit ugly because these iterators assume the real mode is known for each file up front. Getting around that assumption isn't all that easy...

On my Mac, I don't see any significant performance improvements, though. So maybe this is a red herring with respect to Vinit's problem. I'll have to give this a try on windows, but I'll first have to find a machine. I only have Mac and Linux at my fingertips.

Comment 7 Eclipse Genie

2018-03-24 05:14:13 EDT

New Gerrit change created: https://git.eclipse.org/r/120118

Comment 8 Thomas Wolf

2018-03-24 05:23:53 EDT

(In reply to Eclipse Genie from comment #7)
> New Gerrit change created: https://git.eclipse.org/r/120118

Delaying getting the attributes doesn't help. It is possible to do so, but we need to know at least whether a file is a directory in order to sort the children correctly. isDirectory() internally reads the attributes, too... so even if we delay calling FS.getAttributes(), we'll still read the attributes under the hood, and we'll be no better off.

The above change tries to take advantage of the fact that Windows stores the file attributes in the directory. In my tests, it did bring a significant speed-up on Windows, but it doesn't solve the basic problem that the FileTreeIterator gets the full directory listing even if most files will be filtered out later again.

Would be good if people working on Windows tested this change to see if it does bring a speed-up on Windows also in real-life settings.

Comment 9 Eclipse Genie

2018-03-25 08:34:35 EDT

Gerrit change https://git.eclipse.org/r/120118 was merged to [master].
Commit: http://git.eclipse.org/c/jgit/jgit.git/commit/?id=4bfc6c2ae9ec582575b05f4e63ee62212bb284a4

Comment 10 Thomas Wolf

2018-03-25 10:53:46 EDT

I'll provisionally close this as "fixed".

@Vinit: the fix is available in the EGit nightly update site at http://download.eclipse.org/egit/updates-nightly/ . Please update EGit and JGit from there, and check if it's any better now. I'd be interested in the timings now that https://git.eclipse.org/r/120118 is included.

If it's still not satisfactory, feel free to re-open this bug report, and please provide more info. Are all these 10000 files in the same directory? What's the directory layout? What ignored directories are there, and what's their layout?

Comment 11 Vinit Gupta

2018-07-19 19:51:10 EDT

@Thomas I gave a try to this i found this shows some improvement in Windows environment but it doesn't show any improvement in linux. Our directory structure is dynamic and files can spread in multiple directory and any level. Can we get a generalize solution which can work on all framework?

Comment 12 Thomas Wolf

2018-07-20 02:46:57 EDT

(In reply to Vinit Gupta from comment #11)
> @Thomas I gave a try to this i found this shows some improvement in Windows
> environment but it doesn't show any improvement in linux.

That's expected; it's exactly what https://git.eclipse.org/r/#/c/120118/ does.

> Our directory
> structure is dynamic and files can spread in multiple directory and any
> level. Can we get a generalize solution which can work on all framework?

Can you give more details? A concrete example with timings on Linux and Windows?

In JGit there is the stand-alone JUnit test program FileTreeIteratorPerformanceTest.java. It is based on the example in your comment 1 above, and prints timings for repeatedly adding single files. Run it as JUnit test on Linux and Windows a few times, and attach the outputs of the third run on each system to this bug report.

For comparison, please also do the same test (adding single files in one directory to an empty repository, 500 times) using command-line git. Time each add command.

If the JGit times and command line git times are similar I doubt we could improve this much. It's unlikely to beat compiled C code with Java. Though for this particular test JGit might have an advantage since it does it all in one process.

But more interesting than the absolute times would be the trend. With JGit I see that adding a file still takes slightly longer the more files there are, which is to be expected as long as it's linear with respect to the number of files in the directory. Intermittently some adds take much longer, which may have to do with window cache misses. But averaged out it seems to be about O(C * N) for some constant C. I wonder if command-line git show similar behavior.

As long as we have roughly linear timings, we could only hope to make that constant C smaller, and I'm not sure there's much leeway there. If the timings are clearly non-linear, then we may have a chance to really improve this further.

Also interesting might be to time adding 500 files in different directories,
both in Java and in command-line git. Both flat (directory<N>/newfile), and nested (directory<N/20>/newfile<N%20>) or even (directory<N/20>/directory<N%20>/newfile).