93638 – Verify performance of text search in 3.1 against 3.0 [performance]

Bug 93638 - Verify performance of text search in 3.1 against 3.0 [performance]

Summary: Verify performance of text search in 3.1 against 3.0 [performance]

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Search (show other bugs)
Version:	3.1
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Tobias Widmer
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Depends on:
Blocks:

Reported:	2005-05-04 09:04 EDT by Tobias Widmer
Modified:	2009-08-19 05:37 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
CPU Sampling Trace (5.64 KB, application/x-zip-compressed) 2005-05-04 09:05 EDT, Tobias Widmer	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tobias Widmer

2005-05-04 09:04:44 EDT

N20050502-0010: Searching in text files has become about 4 times slower on my 
machine than using 3.0:

Setup is: Workingset of 6 plugin projects with a total of 980 source files. 
Search string is "\n".

CPU Time: 3.1: 20400 ms, 3.0: 5150 ms

CPU Sampling shows that almost all time is spent inside 
org.eclipse.core.internal.watson.ElementTreeIterator#doIteration

Attaching trace...

Comment 1 Tobias Widmer

2005-05-04 09:05:39 EDT

Created attachment 20674 [details]
CPU Sampling Trace

Comment 2 Tobias Widmer

2005-05-04 10:14:06 EDT

It seems like the performance degradation is due to extensive file I/O caused 
by the resource tree traversal.

Comment 3 John Arthorne

2005-05-04 10:54:44 EDT

I have doubts about your findings:

1) Element tree iteration doesn't do *any* file I/O. It is traversing a tree in
memory.
2) This code has hardly changed since before 3.0. The last major change in the
iterator code was in January 2003 when we added support for the resource proxy
visitor.
3) Tree iteration is essentially a recursive method that just calls the
visitor's "visit" method on each element.  From the trace, I see the amount of
time in all the visit methods adds up to nearly all the time of the iteration.

Comment 4 Dirk Baeumer

2005-05-04 17:32:22 EDT

It seems that a regex matcher asks the char sequence whenever it get reset for
the actual length of the char sequence. I think for the char sequence
implemented in search we have to read the file twice as soon as its size in char
is greater than > 3 * 2048 bytes.

Moving back to search to investigate this further.

Comment 5 Dirk Baeumer

2005-05-05 10:56:01 EDT

I did some additional measurement and the char sequence algorithm isn't a
performance problem at all. Here are the numbers (I rebooted the machine for
every test I did):

Test case- searching in 10715 files for string "ltk", which produced 1277 matches:

Build                                   first search              second search

Nightly build 20050505                  3:52 min                  25 sec

Increased buffer to 256 MB. This        3:53 min                  25 sec
gives 768 MB size for files, into
which most files should fit

Harcoded the length of the char         3:35 min                  25 sec
sequence to 14680 characters and
always returned 'x' from chatAt.
The file was still opened and
closed.

This was actually a big surprise. This means that the time isn't spend at all
reading the files from disk. Note that due to the fact that I always returned
'x' no matches where found.

As above, plus hardcoded the            3:10 min                  30 sec
encoding to Cp1252

This shows that we need 25 seconds to get the encoding for 10715 files

As above, but didn't open the file      25 sec                    25 sec
at all. 

This means we spent almost all the time in opening the file not in reading it or
doing the regex matching. 

NO chatAt or length tweaking, but       3:06                      25 sec
directly using io.File to access
the file not IFile.getContent

This means we spent ~50 sec for the overhead of IFile.getContent.

I then executed a refesh from local on the workspace and this took 2:50 min.
Refresh from local doesn't read any file content however it reads the file stat
to see if the file is out of sync.

Executed a native Windows search on all *.java files on the workspace. I cancel
the search after 5 minutes.

Comment 6 John Arthorne

2005-05-05 12:09:45 EDT

The I/O measurements don't surprise me, I have seen this pattern before. 
Simplying opening a file in Java seems to load some low level disk cache with
the first part of the file.  Any reading from the stream on the first 8K or so
it essentially "free" because it's already buffered at the OS level or below.

I'm a bit surprised by your refresh local measurements.  A refresh local on my
workspace (about 25,000 files), assuming everything is in sync, takes about 5
seconds for me.  Is the "2:50 min" measuring multiple repetitions of refreshLocal?

Comment 7 Dirk Baeumer

2005-05-05 12:38:59 EDT

After getting these result I redid the performance measurements (I should have
done this in the first place ;-)) and surprisingly there isn't any difference
between 3.0.2 and 3.1. Here are the numbers:

Setup: workspace with the following projects in source (from 3.0) autobuild off.

org.eclipse.jdt.core
org.eclipse.jdt.ui
org.eclipse.jdt.ui.tests.refactoring
org.eclipse.ui.ide
org.eclipse.ui.views
org.eclipse.ui.workbench

This gives you total 8583 files. Without test.refactoring (which are many small
files) there are 3671 files. Searching for _x_ (to avoid to many matches) with
the standard settings gives the following picture:

3.0.2:

8583 files:                   2:21
3671 files:                   1:50

3.1:

8583 files:                   2:08
3671 files:                   1:50

So there isn't any slow down in search between 3.0.2 and 3.1.

Tobias, please verify these numbers again and if they are correct close the bug
and open a bug against core for the findings of the encoding and the native file
usage. May be they are of any help.

Comment 8 Dirk Baeumer

2005-05-05 12:40:24 EDT

John, regading the refresh local. The numbers may not be comparable since I did
it on the whole workspace whereas the search was only for *.java files. I will
redo the refresh when I boot the next time.

Comment 9 Dani Megert

2005-05-08 15:44:36 EDT

For a while I have it on my To Do list to gather some numbers steps sot that I
could file a bug, since I also have the feeling that file/text search got
significantly slower. Note that I use flat layout.

Comment 10 Dirk Baeumer

2005-05-08 17:46:06 EDT

Dani, as the numbers show almost all time is spent in opening the files. Since
this is the same between 3.0 & 3.1 I am pretty sure that the performance sin't
really different.

Please note that for some time we used NIO which might has given us some real
speed improvements but we can't ship on it due to the known problems with NIO.

Dani, do you have a concrete use case Tobias could look at to verify my findings.

Comment 11 Dani Megert

2005-05-10 11:15:16 EDT

I went over this during the test pass and tried various setups and compared them
with 3.0.2 without finding a regression.

When I encountered the feeling for a slowdown it was mostly very significant as
it would wait/hang inside a special file or for a special file.

Comment 12 Martin Aeschlimann

2005-06-01 09:13:36 EDT

setting to remind.

Comment 13 Dani Megert

2009-08-19 05:37:11 EDT

No further work planned on this one.