Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RawTextComparator hash is bad

Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
> 
> I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
> above according to my measurements. You might want to measure it in your test 
> suite.
...
> [1] http://bretm.home.comcast.net/~bretm/hash/3.html

Its not really that much better.  With these functions we don't need
to run the LK cleanup afterwards; its safe to just take the lower
N bits like TRUNC does.  But the inner loop of the hash function
is far more complex.

My test suite isn't doing benchmarks, but I'm willing to go out on
a limb and say that doing 4 more shifts, 2 more additions, and 2
more xors *per byte of a line* (like hash_tb does) is going to be
slower than a single multiply at the end of the line.

If we assume the average line is at least 3 bytes to hash (aka
"\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles
over djb2.  For a slightly more typical line that has say 15 bytes
("\tint count = 0;\n"), its an at extra 120 clock cycles per line.
Integer multiplication is slow, but its not *that* slow.

So hash_t, hash_tx2, hash_tb all have lower collision rates than
djb2 TRUNC, but they are more complex on the inner loop and don't
get us that much better of a result than djb2 LK.

I'd rather stick with djb2 LK.


test_jgit: 928 files; 122 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                   5
  djb2 LK                      6

  string_hash31 TRUNC         11
  string_hash31 LK             5

  hash_t TRUNC                 5
  hash_t LK                    5

  hash_tx2 TRUNC               6
  hash_tx2 LK                  5

  hash_tb TRUNC                5
  hash_tb LK                   7

test_linux26: 30198 files; 258 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                  32
  djb2 LK                      7

  string_hash31 TRUNC         32
  string_hash31 LK             7

  hash_t TRUNC                 7
  hash_t LK                    8

  hash_tx2 TRUNC               7
  hash_tx2 LK                  7

  hash_tb TRUNC                7
  hash_tb LK                   7

test_frameworks_base: 8381 files; 184 avg. unique lines/file
  Algorithm                  Max
  djb2 TRUNC                  10
  djb2 LK                      7

  string_hash31 TRUNC         13
  string_hash31 LK             6

  hash_t TRUNC                 7
  hash_t LK                    7

  hash_tx2 TRUNC               7
  hash_tx2 LK                  6

  hash_tb TRUNC                6
  hash_tb LK                   6



-- 
Shawn.


Back to the top