Community
Participate
Working Groups
Error detection was always a planned feature for the new index. The index should contain CRCs which are capable of detecting disk corruption and aborted writes.
I implemented the CRC approach fully, but it didn't work out as expected. TL;DR my fastest implementation of CRCs was 4% slower than using a lock flag, which is still too slow IMO. On this basis, I plan to deviate from the original design and use lock flags instead. This will protect the index from aborted writes (where eclipse crashes in the middle of a write) but not from disk corruption. If we're still getting reports of index corruption with this approach, we can re-investigate the CRC approach. I'll push my implementation to github so we can recover it if we ever need it. I tried a number of hashing functions, block sizes, CRC bit sizes, and both hierarchal and flat CRC tables. I compared these approaches with the one used by CDT (a lock flag that is written at the start of writes and cleared at the end). In every case, the CRC function was a significant bottleneck. Here's the raw data (measured the average indexing time in a workspace with 54,000 classes): No error correction, 4k block size (4 runs): 1.02857ms +/- 0.08153ms StreamHasher, 8k block size, 4 byte crc (1 test run): 3.28166ms StreamHasher, 8k block size, 2 byte crc (1 test run): 3.083238ms adler32, 8k block size, 2 byte crc (1 test run): 1.485546ms simple checksum, 8k block size, 2 byte crc (1 test run): 1.205637ms simple checksum, 4k block size, 2 byte crc (5 test runs): 1.06787 +/- 0.05364
I've pushed the CRC implementations to the "error_correction" branch. The one that's there is the (4k/2byte/simple checksum) variant that performed best. The main "newindex" branch is using the lock flag approach.