I've pulled all the search questions
from the UX thread into a separate thread.
Q: Can we incrementally index
whenever a file changes?
There is no resource model
on the server and no resource change events. That Eclipse desktop infrastructure
is very heavy-weight, memory intensive, and slow to start/stop. All we
have on the server is a file system, and a bunch of servlets that perform
CRUD operations on that file system. There is no metadata about the file
system retained between requests. When we first built the server, it was
considered an important design point to keep the various services independent,
so for examples the servlets doing file access or Git operations don't
know anything about a search service.
I think there is a middle ground
where we could create a very lightweight "change event" queue
that the indexer processes. This could greatly reduce the time to index
for the most common cases and probably mitigate much of the UX problem.
I have entered this bug to track it: https://bugs.eclipse.org/bugs/show_bug.cgi?id=378371
Q: Can we detect the case where
the time of the last indexing is earlier than the time of the last change
to the user's files?
I can't think of any way to
do this. There is little useful correlation between the timestamp of a
file and the time an indexing occurred. If you import files from zip or
SFTP, the timestamp of the file may be older but the content still not
indexed. If you switch to another branch in git, files can either go backwards
or forwards in time and either way they will need re-indexing. The most
we could say is that if the timestamp of a file is greater than the time
of last indexing, then it probably needs indexing. This wouldn't catch
all the cases, and would still require a crawl of the entire filesystem
comparing timestamps. I don't see us being able to do this on every request.
The first thing our indexer does is compare current file timestamp with
the timestamp of the file record in the index, so our indexer essentially
already does this search and it is not fast.
Q: Can we find or build a smarter
There are commercial tools
but I haven't found any suitable open source packages for this. I am certain
that Lucene has a enough flexibility and infrastructure to do this if we
wanted to build it ourselves. To give some context, Lucene is really a
passive participant in the indexing process. The client (Orion code) has
to pass in structured documents to Lucene. Each document is made of fields,
and the way indexing is performed depends on the field type. Currently
we are telling Lucene that the contents of a file are a single unstructured,
white-space delimited string. This doesn't give Lucene a lot of semantic
information to go on. If we did language analysis at indexing time we could
add fields for all the functions, variables, comments, etc. This would
enable searches like "find all references/declarations for function
foo", and also enable things like being a bit more fuzzy with searches
within comments which we know are natural language text, while always being
strict about matching within semantic structures. I have seen examples
of this in Lucene for other specific languages.
If we were moving into server
side language tooling for other reasons, then making a smarter code search
is definitely an option. I am a bit worried about heading that direction
because we don't have a good extensibility story for server side language
tools. For now I think we should keep the search completely generic but
try to continue making improvements such as better phrase searching. We
also already have all the information we need in the index to query based
on file size or last modified time, which I think could be useful (search
all files modified in the last five days, or all files under 10KB, for