[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[orion-dev] Search discussion from UX thread

I've pulled all the search questions from the UX thread into a separate thread.

Q: Can we incrementally index whenever a file changes?

There is no resource model on the server and no resource change events. That Eclipse desktop infrastructure is very heavy-weight, memory intensive, and slow to start/stop. All we have on the server is a file system, and a bunch of servlets that perform CRUD operations on that file system. There is no metadata about the file system retained between requests. When we first built the server, it was considered an important design point to keep the various services independent, so for examples the servlets doing file access or Git operations don't know anything about a search service.

I think there is a middle ground where we could create a very lightweight "change event" queue that the indexer processes. This could greatly reduce the time to index for the most common cases and probably mitigate much of the UX problem. I have entered this bug to track it: https://bugs.eclipse.org/bugs/show_bug.cgi?id=378371

Q: Can we detect the case where the time of the last indexing is earlier than the time of the last change to the user's files?

I can't think of any way to do this. There is little useful correlation between the timestamp of a file and the time an indexing occurred. If you import files from zip or SFTP, the timestamp of the file may be older but the content still not indexed. If you switch to another branch in git, files can either go backwards or forwards in time and either way they will need re-indexing. The most we could say is that if the timestamp of a file is greater than the time of last indexing, then it probably needs indexing. This wouldn't catch all the cases, and would still require a crawl of the entire filesystem comparing timestamps. I don't see us being able to do this on every request. The first thing our indexer does is compare current file timestamp with the timestamp of the file record in the index, so our indexer essentially already does this search and it is not fast.

Q: Can we find or build a smarter code index?

There are commercial tools but I haven't found any suitable open source packages for this. I am certain that Lucene has a enough flexibility and infrastructure to do this if we wanted to build it ourselves. To give some context, Lucene is really a passive participant in the indexing process. The client (Orion code) has to pass in structured documents to Lucene. Each document is made of fields, and the way indexing is performed depends on the field type. Currently we are telling Lucene that the contents of a file are a single unstructured, white-space delimited string. This doesn't give Lucene a lot of semantic information to go on. If we did language analysis at indexing time we could add fields for all the functions, variables, comments, etc. This would enable searches like "find all references/declarations for function foo", and also enable things like being a bit more fuzzy with searches within comments which we know are natural language text, while always being strict about matching within semantic structures. I have seen examples of this in Lucene for other specific languages.

If we were moving into server side language tooling for other reasons, then making a smarter code search is definitely an option. I am a bit worried about heading that direction because we don't have a good extensibility story for server side language tools. For now I think we should keep the search completely generic but try to continue making improvements such as better phrase searching. We also already have all the information we need in the index to query based on file size or last modified time, which I think could be useful (search all files modified in the last five days, or all files under 10KB, for example).