Bug 574845 - Indexing performance affected by metaindex changes
Summary: Indexing performance affected by metaindex changes
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 4.21   Edit
Hardware: PC All
: P3 normal (vote)
Target Milestone: 4.21 M2   Edit
Assignee: Andrey Loskutov CLA
QA Contact:
URL:
Whiteboard:
Keywords: performance, regression
Depends on:
Blocks: 570078
  Show dependency tree
 
Reported: 2021-07-14 09:16 EDT by Gayan Perera CLA
Modified: 2021-08-26 05:40 EDT (History)
2 users (show)

See Also:


Attachments
slow index samples (28.03 KB, application/octet-stream)
2021-07-14 09:18 EDT, Gayan Perera CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gayan Perera CLA 2021-07-14 09:16:41 EDT
In eclipse build  I20210713-1800 the indexing of 5000 files is extremly slow compared to previous release. This was introduce after Bug570078 and related deadlock issue fixes.
Comment 1 Gayan Perera CLA 2021-07-14 09:17:45 EDT
@Andrey we might need to disable meta index again until we fix this. WDYT ?
Comment 2 Gayan Perera CLA 2021-07-14 09:18:05 EDT
Created attachment 286788 [details]
slow index samples
Comment 3 Andrey Loskutov CLA 2021-07-14 10:00:56 EDT
(In reply to Gayan Perera from comment #0)
> In eclipse build  I20210713-1800 the indexing of 5000 files is extremly slow

Could you quantify that? 2x slower, 20x?

> compared to previous release.

Are you comparing 4.21 with 4.20 or 4.21 master with the 4.21 build before https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/181878 ?

(In reply to Gayan Perera from comment #1)
> @Andrey we might need to disable meta index again until we fix this. WDYT ?

I would say, please check what causes this first. May be we create meta index properly for the first time :-)

Note, the last patch not only fixed deadlock conditions but *enabled* meta index, so to compare performance of index creation with the change before one need to run Eclipse with -Dorg.eclipse.jdt.disableMetaIndex=false.
Comment 4 Andrey Loskutov CLA 2021-07-15 08:47:22 EDT
Gayan, were you able to profile/analyze the performance issue? Any insights?

I've quickly checked 4.20 release against latest master and it looks that we have indexing performance drop in the area of ~30% for my specific workspace on my specific HW, for the full index triggered manually via Preferences -> Java -> Rebuild Index. 

On master that takes ~3 minutes, on 4.20 ~2.5 minutes. I haven't compared different 4.21 builds with each other and enabled / disabled meta index yet.
Comment 5 Gayan Perera CLA 2021-07-15 13:54:49 EDT
I did a quick sampling and saw that a lot of indexing time is spent on querying the qualifiers from the index back before merging them into meta index. I have to see why that is happening. May be the changes we did to avoid dead locks cause this. I will check more on this and provide a patch.

One thing I notice was when the patch was enabled my eclipse was laggy specially after doing code changes, which means delta updates resulting in index updates. So I will look into this and see what going on as well.

I wonder if there could be a way to count index reads and writes with some specific tests targeting to catch these kind of issues leaking in.
Comment 6 Gayan Perera CLA 2021-07-15 14:58:02 EDT
Checked on the diff and its mostly like 6% impact I see with new changes we added to fix the deadlock. I did indexing on STS4 project modules.

The build impact is higher on earlier than now. With the dead lock fix a clean build of STS4 workspace is 33% faster.

So I think the problem I saw is something else or may be I might need to more larger workspace. But I will see if we can improve the qualification query. Seems like we tends to reload the qualification from disk to memory just after dumping the memory index into disk. So may be reading while everything is on memory index might improve it.
Comment 7 Andrey Loskutov CLA 2021-07-15 15:15:42 EDT
Sounds reasonable. 
Looking on your snapshot, we could probably supply index.getMetaIndexQualifications() wrapped as lambda to updateMetaIndex() so in case it is not needed at all in updateMetaIndex() we do not need to calculate it?
Comment 8 Gayan Perera CLA 2021-07-15 15:56:55 EDT
Yes, but how do we decide whether it needs to update or not ? I was more thinking only update meta index if the index has only changed at saveIndex method. Because today regardless index has changes or not we update meta index. WDYT ?
Comment 9 Gayan Perera CLA 2021-07-17 05:22:02 EDT
@Andrey, I did some changes I previously mentioned, but seems like I was wrong, because memory index get merged as soon as it reaches its limit before starting queries. And when queries it always use both memory and disk if the merging doesn't happen. So I think we use the memory index as a buffer to avoid too much disk writes.

So moving the updateMetaIndex before save doesn't change the metaIndex time. And most of the requests into saveIndex while indexing is when there are changes.

so I was thinking about what you suggested, moving the query until we need, do you see lot of cases where we will exit the updateMetaIndex before merging qualifications into meta index ?
Comment 10 Andrey Loskutov CLA 2021-07-17 10:17:52 EDT
(In reply to Gayan Perera from comment #9)
> so I was thinking about what you suggested, moving the query until we need,
> do you see lot of cases where we will exit the updateMetaIndex before
> merging qualifications into meta index ?

What I observe for a project with 20000 source files is that IndexManager.scheduleDocumentIndexing(SearchDocument, IPath, IndexLocation, SearchParticipant) is called 20000 times for every source file indexed and updates the index for that project 20000 times with increasing number of elements every time.

Means, we start with project index at some state (let say 5 files), remove everything from project in meta-index and ask to add entries for 6 files. Next request, remove everything from project in meta-index and ask to add entries for 7 files. Next request ... and so on, 20000 times for the project, with ever increasing work amount.

Either we can realize *when* the project source indexing is complete and update meta-index once or  we could manage a single threaded *filtering* queue where we put index files to be re-indexed by meta index, and some delay to work on. So if we get 20000 times request to update the index, the things would look like:

queue.put(index_for_project_X)
queue.put(index_for_project_X)
...
# 20000 times
queue.put(index_for_project_X)

The single worker job that maintains the queue peeks the first element from it and starts update metaindex. If the same element is added at same time we work on that index, job can cancel and re-schedule again. If the queue has same element already in the queue, queue.put() simply does nothing.

This way we can reduce 20000 requests to update the metaindex for the project X to may be ~10 requests in worst case, depending how long the scheduling delay can be selected.

Would you look at that?

Independently from that above, I will push a small cleanup patch now so it is easier to see where the index is working and where not.
Comment 11 Eclipse Genie CLA 2021-07-17 10:18:27 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/183140
Comment 13 Gayan Perera CLA 2021-07-17 12:52:13 EDT
(In reply to Andrey Loskutov from comment #10)
> 
> What I observe for a project with 20000 source files is that
> IndexManager.scheduleDocumentIndexing(SearchDocument, IPath, IndexLocation,
> SearchParticipant) is called 20000 times for every source file indexed and
> updates the index for that project 20000 times with increasing number of
> elements every time.
> 
> Means, we start with project index at some state (let say 5 files), remove
> everything from project in meta-index and ask to add entries for 6 files.
> Next request, remove everything from project in meta-index and ask to add
> entries for 7 files. Next request ... and so on, 20000 times for the
> project, with ever increasing work amount.
> 
> Either we can realize *when* the project source indexing is complete and
> update meta-index once or  we could manage a single threaded *filtering*
> queue where we put index files to be re-indexed by meta index, and some
> delay to work on. So if we get 20000 times request to update the index, the
> things would look like:
> 
> queue.put(index_for_project_X)
> queue.put(index_for_project_X)
> ...
> # 20000 times
> queue.put(index_for_project_X)
> 
> The single worker job that maintains the queue peeks the first element from
> it and starts update metaindex. If the same element is added at same time we
> work on that index, job can cancel and re-schedule again. If the queue has
> same element already in the queue, queue.put() simply does nothing.
> 
> This way we can reduce 20000 requests to update the metaindex for the
> project X to may be ~10 requests in worst case, depending how long the
> scheduling delay can be selected.
> 
> Would you look at that?

Yes i was looking into this in the morning. But didn't found the project source scenario. Yes i was also thinking along your suggestion, rather than trying to have a separate single threaded queue working on indexes with a delay, we could maintain a filtering queue to keep track of unique indexes to be index by meta-index and push a job to do the merge into the same indexer thread if there are no pending meta-index merge job. When the meta-index job's time comes, it will check if there are more jobs pending in indexer, if yes it will cancel other wise it will drain the meta index queue. This way we don't need to worry about race conditions with search jobs and meta-index thread.

> 
> Independently from that above, I will push a small cleanup patch now so it
> is easier to see where the index is working and where not.
Approved and merged since it was a really small change. Hope its ok.
Comment 14 Eclipse Genie CLA 2021-07-17 15:45:31 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/183142
Comment 15 Andrey Loskutov CLA 2021-07-17 15:46:15 EDT
(In reply to Eclipse Genie from comment #14)
> New Gerrit change created:
> https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/183142

Gayan, I haven't tested that at all, but I believe something like this could make sense. Could you check that?
Comment 16 Gayan Perera CLA 2021-07-17 15:59:18 EDT
I also sketched a similar implementation based on synchronized ordered Set. But i will check your implementation, i see that you have used the filtering by check and add. I did it by using a entry object to capture index, name and remove flag. Because when index is removed no point of trying query qualifications.
Comment 17 Andrey Loskutov CLA 2021-07-17 18:07:21 EDT
(In reply to Gayan Perera from comment #16)
> I also sketched a similar implementation based on synchronized ordered Set.
> But i will check your implementation, i see that you have used the filtering
> by check and add. I did it by using a entry object to capture index, name
> and remove flag. Because when index is removed no point of trying query
> qualifications.

Good point, I've updated the patch - on removal, index is immediately removed from meta index updates queue.
Comment 18 Andrey Loskutov CLA 2021-07-18 04:42:05 EDT
(In reply to Andrey Loskutov from comment #17)
> Good point, I've updated the patch - on removal, index is immediately
> removed from meta index updates queue.

I've now compared indexer performance before/after this patch.
I have following test setup: 

1) Java8 project that contains all Eclipse 3.8 jar files from plugins directory as libraries on classpath + JDK 8.
2) Java11 project that contains all Eclipse 4.20 jar files from plugins directory as libraries on classpath + JDK 11.
3) My "JavaProhectGenerator" project at branch https://github.com/iloveeclipse/java-project-generator/tree/Bug_574845 - after running "Main" class inside it will contain 10.000 java files. The project has no other libraries referenced and JDK 11 as the only dependency.

This setup on master needs ~5 minutes for a full index rebuild started from Preferences -> Java, and only ~1 minute after the patch. The most time isn't spent on indexing of first two projects that contain A LOT of libraries, but on the third one.

For the first two projects both master / patch need same time, around 25 seconds. The rest time (~4:30 on master and ~30 with patch) is spent on indexing the single project with 10.000 source files.

The problem with current code on master is described in comment 10 - the code has N! indexing effort per project where N is the number of source files in the project. That grows extremely fast with bigger projects. "Small" projects with average of ~500 files like typical SDK bundles I had in my workspace in comment 4 aren't affected that much (I've measured only 30% performance decrease), but the performance drops rapidly with project size increase, so here we have 500% performance decrease and that is not yet for a really big project (we have customers with ~100.000 files in a single project).
Comment 20 Gayan Perera CLA 2021-07-18 15:26:03 EDT
I think this should solve my performance issue i felt as well. Now i remember it happen twice and both of the time after pulling lot of changes in my projects from origin.

So it explains why eclipse cause CPU spikes when every pulling changes and refreshing the workspace after a clean build which changes the generated source files.
Comment 21 Andrey Loskutov CLA 2021-07-19 08:33:57 EDT
Gayan, could you please verify with the build https://download.eclipse.org/eclipse/downloads/drops4/I20210718-1800/ ?
Comment 22 Gayan Perera CLA 2021-07-19 12:48:20 EDT
Yes the fix is working as expected.
Comment 23 Jay Arthanareeswaran CLA 2021-08-26 05:40:31 EDT
(In reply to Gayan Perera from comment #22)
> Yes the fix is working as expected.

Thanks Gayan, marking the bug as verified.