Community
Participate
Working Groups
N20050502-0010: Searching in text files has become about 4 times slower on my machine than using 3.0: Setup is: Workingset of 6 plugin projects with a total of 980 source files. Search string is "\n". CPU Time: 3.1: 20400 ms, 3.0: 5150 ms CPU Sampling shows that almost all time is spent inside org.eclipse.core.internal.watson.ElementTreeIterator#doIteration Attaching trace...
Created attachment 20674 [details] CPU Sampling Trace
It seems like the performance degradation is due to extensive file I/O caused by the resource tree traversal.
I have doubts about your findings: 1) Element tree iteration doesn't do *any* file I/O. It is traversing a tree in memory. 2) This code has hardly changed since before 3.0. The last major change in the iterator code was in January 2003 when we added support for the resource proxy visitor. 3) Tree iteration is essentially a recursive method that just calls the visitor's "visit" method on each element. From the trace, I see the amount of time in all the visit methods adds up to nearly all the time of the iteration.
It seems that a regex matcher asks the char sequence whenever it get reset for the actual length of the char sequence. I think for the char sequence implemented in search we have to read the file twice as soon as its size in char is greater than > 3 * 2048 bytes. Moving back to search to investigate this further.
I did some additional measurement and the char sequence algorithm isn't a performance problem at all. Here are the numbers (I rebooted the machine for every test I did): Test case- searching in 10715 files for string "ltk", which produced 1277 matches: Build first search second search Nightly build 20050505 3:52 min 25 sec Increased buffer to 256 MB. This 3:53 min 25 sec gives 768 MB size for files, into which most files should fit Harcoded the length of the char 3:35 min 25 sec sequence to 14680 characters and always returned 'x' from chatAt. The file was still opened and closed. This was actually a big surprise. This means that the time isn't spend at all reading the files from disk. Note that due to the fact that I always returned 'x' no matches where found. As above, plus hardcoded the 3:10 min 30 sec encoding to Cp1252 This shows that we need 25 seconds to get the encoding for 10715 files As above, but didn't open the file 25 sec 25 sec at all. This means we spent almost all the time in opening the file not in reading it or doing the regex matching. NO chatAt or length tweaking, but 3:06 25 sec directly using io.File to access the file not IFile.getContent This means we spent ~50 sec for the overhead of IFile.getContent. I then executed a refesh from local on the workspace and this took 2:50 min. Refresh from local doesn't read any file content however it reads the file stat to see if the file is out of sync. Executed a native Windows search on all *.java files on the workspace. I cancel the search after 5 minutes.
The I/O measurements don't surprise me, I have seen this pattern before. Simplying opening a file in Java seems to load some low level disk cache with the first part of the file. Any reading from the stream on the first 8K or so it essentially "free" because it's already buffered at the OS level or below. I'm a bit surprised by your refresh local measurements. A refresh local on my workspace (about 25,000 files), assuming everything is in sync, takes about 5 seconds for me. Is the "2:50 min" measuring multiple repetitions of refreshLocal?
After getting these result I redid the performance measurements (I should have done this in the first place ;-)) and surprisingly there isn't any difference between 3.0.2 and 3.1. Here are the numbers: Setup: workspace with the following projects in source (from 3.0) autobuild off. org.eclipse.jdt.core org.eclipse.jdt.ui org.eclipse.jdt.ui.tests.refactoring org.eclipse.ui.ide org.eclipse.ui.views org.eclipse.ui.workbench This gives you total 8583 files. Without test.refactoring (which are many small files) there are 3671 files. Searching for _x_ (to avoid to many matches) with the standard settings gives the following picture: 3.0.2: 8583 files: 2:21 3671 files: 1:50 3.1: 8583 files: 2:08 3671 files: 1:50 So there isn't any slow down in search between 3.0.2 and 3.1. Tobias, please verify these numbers again and if they are correct close the bug and open a bug against core for the findings of the encoding and the native file usage. May be they are of any help.
John, regading the refresh local. The numbers may not be comparable since I did it on the whole workspace whereas the search was only for *.java files. I will redo the refresh when I boot the next time.
For a while I have it on my To Do list to gather some numbers steps sot that I could file a bug, since I also have the feeling that file/text search got significantly slower. Note that I use flat layout.
Dani, as the numbers show almost all time is spent in opening the files. Since this is the same between 3.0 & 3.1 I am pretty sure that the performance sin't really different. Please note that for some time we used NIO which might has given us some real speed improvements but we can't ship on it due to the known problems with NIO. Dani, do you have a concrete use case Tobias could look at to verify my findings.
I went over this during the test pass and tried various setups and compared them with 3.0.2 without finding a regression. When I encountered the feeling for a slowdown it was mostly very significant as it would wait/hang inside a special file or for a special file.
setting to remind.
No further work planned on this one.