415874 – Search scalability

Bug 415874 - Search scalability

Summary: Search scalability

Status:	RESOLVED FIXED

Alias:	None

Product:	Orion (Archived)
Classification:	ECD
Component:	Server (show other bugs)
Version:	3.0
Hardware:	PC Windows 7

Importance:	P3 normal (vote)
Target Milestone:	4.0 RC1
Assignee:	libing wang
QA Contact:

URL:
Whiteboard:
Keywords:	noteworthy

Depends on:
Blocks:

Reported:	2013-08-26 09:52 EDT by John Arthorne
Modified:	2013-09-27 12:14 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description John Arthorne

2013-08-26 09:52:39 EDT

4.0 M1

We are seeing scalability problems with the search implementation on orionhub. It used to be extremely fast (< 1 second search), but we have seen a dramatic increase in search times. We are now seeing upwards of 30 seconds to perform a search.

The search implementation is Apache Lucene 3.5. Lucene is certainly capable of very fast searches on large amounts of data. There are a number of things to explore:

 - Is our Lucene/Solr configuration optimal or is it using settings that are affecting search times
 - We can split Lucene/Solr into a separate process, allowing dedicated heap, and sharding across multiple instances if needed
 - We have a single search index for the entire workspace. Each user is only searching over their own files, which is a fraction of the workspace. Each query is parameterized to only search on that user's files. Is there a way to optimize the index of search for this case (like setting user as primary search key).

Comment 1 John Arthorne

2013-08-26 13:31:12 EDT

Note the current index on orionhub.org is 2.8 GB. I have read docs that recommend allocating a heap as big as your index, but currently we only allocate 800 MB for entire Orion server. So increasing heap may help here.

Another pointer:

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Especially this part which suggests maybe we should use a filter rather than a query to restrict searches to only a single user's files: 

"Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index."

Comment 2 John Arthorne

2013-08-26 14:00:56 EDT

I think we should try using a filter for the user, rather than making it part of the search query. What this does is first filter out all documents not matching that user, and then performs the main search against only that subset.

http://wiki.apache.org/solr/CommonQueryParameters#fq

Comment 3 John Arthorne

2013-08-26 14:03:35 EDT

I have released the most obvious optimizations to the search schema:

 - removed unused fields
 - Removing indexing from fields we are not search or sort against (just fields that we want to have in the result document we send back to the client, such as case sensitive name, isDirectory, file size
 - Removed unused field types (likely not important but cleans up our schema)

http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=1528e04074276a41d117d8a7979ab73a1192ea30

There are other schema changes such as the merge factor, but it is a tradeoff that affects indexing speed. We should start with the optimizations with no trade-off and see how far it gets us.

Comment 4 libing wang

2013-09-16 15:09:03 EDT

(In reply to John Arthorne from comment #2)
> I think we should try using a filter for the user, rather than making it
> part of the search query. What this does is first filter out all documents
> not matching that user, and then performs the main search against only that
> subset.
> 
> http://wiki.apache.org/solr/CommonQueryParameters#fq

Just read the pointer. It makes sense to me.
Currently we are using UserName as the last part of the query string that is passed to the engine.
But fq as a parameter can help the engine quickly pick up the documents owned by this user for the first round.

The approach is also simple.
We are just removing the last part the query string(e.g. UserName:foo) and use it as the value of parameter "fq"(e.g. query.setParameter("fq", "UserName:foo")).

In OrionHub, we have hundreds of users. This sums up the total doc size to 200G. But a single user's subset should be much smaller.

I had a test script to generate quite big amount docs for a single user.
To experiment this theory I will have to use my test script for different users and see the how much "fq" can improve the speed.

Comment 5 libing wang

2013-09-20 12:24:20 EDT

In my local server, I had a single user holding 20G files.
Then I added 12 more users, holding 30M files on each.
I changed the server code to use filtered query instead of query string.
Then I cleared the index files and restarted the server.
After the index files were rebuilt, I did the steps as below.
1. Log in as a user with 30M files, do a search. It took about 7 seconds to get 500 hits.
2.With the same user, do a search on a different keyword. Same, 7 seconds. Here I expected the speed up because the doc subset from step 1 should have been cached.
3.Switch to the user with 20G files. Do a search. Same 7 seconds, with a million hits.

I was hoping step 2 should have taken less time than step 3 but unfortunately they were the same.

I even rollback the server code to use query string again and the search took about 7 seconds constantly.

This tells me the filtered query is not really taking effect as documented.
I think the bottle neck of our search is that it does not narrow down to a document subset of the logged in user at all.
I have to install a monitoring tool to see why the filter is not cached.

Comment 6 libing wang

2013-09-23 17:51:02 EDT

I've downloaded the latest Solr4.4 and installed it in my laptop.
then I followed the instruction to run it as http server with Jetty.
I was able to run the admin UI for my local Solr as a stand alone server.
I tried over write the "almost empty" index folder with the 3G index files I created with Orion server.

In the admin UI, amazingly I could run the queries, and filter queries based on the testing index files and they are blazing fast.

There are some considerations we should think about now:

1. Orion uses embeddedSolrServer VS my stand alone Solr is an httpSolrServer.
2. Orion uses ver3.5 VS my stand alone Solr is 4.4.
3. embeddedSolrServer is not recommended for production env. A lot of articles say that. We should at least try other type of servers.

Comment 7 libing wang

2013-09-24 17:38:12 EDT

After comparing the queries sending to embeddSolrServer or httpSolrServer, both ver 3.5, I did the following changes for the embeddedSolrServer.
1.Use filter query for both UserName and Location.
2.Remove the preceding "*" for a wild card search.

This significantly speeds up the search, specially after the first search when the user logs in.

If I add the preceding "*" back to the query, both http and embedded servers are much slower.

However, if I use ver 4.4 http server, the "*foo*" search is much much faster.

So here are the choices:
1.Find a better way to index and query the preceding "*" search in 3.5.
2.Upgrade to Solr 4.4.

Comment 8 John Arthorne

2013-09-25 09:00:30 EDT

It is probably the leading wildcard in particular that is hurting us. Try increasing the amount of heap space for an external solr server and see if that helps. For example 2G of heap instead of whatever default you are using.

Comment 9 libing wang

2013-09-25 10:23:47 EDT

(In reply to John Arthorne from comment #8)
> It is probably the leading wildcard in particular that is hurting us. Try
> increasing the amount of heap space for an external solr server and see if
> that helps. For example 2G of heap instead of whatever default you are using.

Yes, indeed.
I increased the heap size to 2G for both 3.5 and 4.4 external solr server.
3.5 got a little faster on query time(e.g, 5 seconds search on *pub*).
4.4 is always very very fast no matter the heap size is increased or not, on a user with smaller data. 4.4 takes longer time on the user with huge data(20G) but that makes sense to me. In reality we cant have a user with 20G data.

I googled some articles regarding the leading wild card search.
There is something we can do but it is a big trade off.
http://lucene.472066.n3.nabble.com/Is-leading-wildcard-search-turned-on-by-default-in-Solr-3-6-1-td4019865.html

Anyway, 4.4 is the only choice in a long run!

Comment 10 libing wang

2013-09-25 11:01:28 EDT

another pointer talking about the lucene leading wild card search.
http://stackoverflow.com/questions/11766351/understanding-lucene-leading-wildcard-performance

Comment 11 libing wang

2013-09-25 12:01:50 EDT

While we are using solr 3.5, another option is to provide a checkbox ("leading wild", maybe?)in the search UI. By default it is false.
If users wants more results then they have to check this out.
For me most of the cases I do not care leading wildcard.

When we move to 4.4 we can remove this checkbox.

Comment 12 libing wang

2013-09-25 14:03:37 EDT

I talked to Ken about the leading wildcard bottle neck.
We think we should expose a placeholder in the search box saying:
"Type search term(e.g. foo or "foo*" or "*foo*".
For most of the cases if user does not care the leading wild card, the search speed will be much faster.
I will also check in the code for adding Location as another filter query because it caches the Location as filter and speeds up the next query as well.

Comment 13 John Arthorne

2013-09-25 14:25:19 EDT

(In reply to libing wang from comment #12)
> I talked to Ken about the leading wildcard bottle neck.
> We think we should expose a placeholder in the search box saying:
> "Type search term(e.g. foo or "foo*" or "*foo*".
> For most of the cases if user does not care the leading wild card, the
> search speed will be much faster.
> I will also check in the code for adding Location as another filter query
> because it caches the Location as filter and speeds up the next query as
> well.

See bug 366212 for details on why we added this. I think it will be problematic to remove the wildcards. For example if source code contains "this.selection" then a search for "selection" will not find a match, which breaks expectations for most people doing code search.

Comment 14 Ken Walker

2013-09-25 14:30:23 EDT

Oh, I thought word denominators included a period which should me it would get found.  If that's not the case then we might need to rethink.

Comment 15 libing wang

2013-09-25 15:13:25 EDT

(In reply to Ken Walker from comment #14)
> Oh, I thought word denominators included a period which should me it would
> get found.  If that's not the case then we might need to rethink.

I gave a quick try on something like "this.selection.add(something)" and typed "sele" in the search box and got hit properly.(On the server side I am only using tailing wild card).
In this case both "selection" and "something" are treated as a word so we do not really need leading wildcard to take care of "." and "(".

According to this article http://fuzzyinfo.com/solr-word-delimiter-filter-factory/, the WordDelimiterFilterFactory used in our schema.xml already takes care of the word splitting.

We are using default settings, so our words are split on intra-word delimiters (all non alpha-numeric characters).

Comment 16 libing wang

2013-09-26 10:12:37 EDT

fixed with http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=1e8f7ef06a7e7d41930adfe1a17e246625d328d9.

Comment 17 John Arthorne

2013-09-26 11:22:38 EDT

This commit broke several of the search tests. Maybe in some cases the tests just need updating but it needs fixing either way.

Comment 18 libing wang

2013-09-26 12:20:25 EDT

fixed unit tests with http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=826b5fe365428cb7953310bc1a86c8868c97d52b.

Added "*" as leading wild card search in the unit test search term.

Comment 19 Mark Macdonald

2013-09-27 11:45:20 EDT

We should mention in the N&N or release notes that a leading wildcard is no longer implicit in search queries.

Comment 20 libing wang

2013-09-27 12:03:09 EDT

(In reply to Mark Macdonald from comment #19)
> We should mention in the N&N or release notes that a leading wildcard is no
> longer implicit in search queries.

Apart from that, we should do something on the UI side.
Adding such information may be not noticeable but I am thinking about adding an info bar in the result pane.
If the search term does not contain leading "*", then mention:
"Leading wild card is not included in the result. Use *foo for more search result."

Comment 21 libing wang

2013-09-27 12:14:28 EDT

Just want to re-clarify this:
Without leading wild card, something like "foo.bar, foo(barrrr)" can still hit "bar" because Solr already uses Non-Alphanumeric Tokens to index them as separate words.