Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] RawTextComparator hash is bad

The Java String hash goes on a stride after a certain length for longer values. Is there any merit in skipping out some characters for faster hash computation? Or maybe skip whitespace chars?

Finally - aren't there likely to be lots of lines which have the same content (and therefore hash) - lines containing a } or <tab>} will probably be both common and repeated. 

Alex

Sent from my iPhone 4

On 24 Sep 2010, at 00:57, "Shawn O. Pearce" <spearce@xxxxxxxxxxx> wrote:

> Robin Rosenberg <robin.rosenberg@xxxxxxxxxx> wrote:
>> 
>> I couldn't resist. Here's a stolen competitor [1], which wins over the hashes
>> above according to my measurements. You might want to measure it in your test 
>> suite.
> ...
>> [1] http://bretm.home.comcast.net/~bretm/hash/3.html
> 
> Its not really that much better.  With these functions we don't need
> to run the LK cleanup afterwards; its safe to just take the lower
> N bits like TRUNC does.  But the inner loop of the hash function
> is far more complex.
> 
> My test suite isn't doing benchmarks, but I'm willing to go out on
> a limb and say that doing 4 more shifts, 2 more additions, and 2
> more xors *per byte of a line* (like hash_tb does) is going to be
> slower than a single multiply at the end of the line.
> 
> If we assume the average line is at least 3 bytes to hash (aka
> "\t\t}\n") then hash_tb needs to expend an extra 24 clock cycles
> over djb2.  For a slightly more typical line that has say 15 bytes
> ("\tint count = 0;\n"), its an at extra 120 clock cycles per line.
> Integer multiplication is slow, but its not *that* slow.
> 
> So hash_t, hash_tx2, hash_tb all have lower collision rates than
> djb2 TRUNC, but they are more complex on the inner loop and don't
> get us that much better of a result than djb2 LK.
> 
> I'd rather stick with djb2 LK.
> 
> 
> test_jgit: 928 files; 122 avg. unique lines/file
>  Algorithm                  Max
>  djb2 TRUNC                   5
>  djb2 LK                      6
> 
>  string_hash31 TRUNC         11
>  string_hash31 LK             5
> 
>  hash_t TRUNC                 5
>  hash_t LK                    5
> 
>  hash_tx2 TRUNC               6
>  hash_tx2 LK                  5
> 
>  hash_tb TRUNC                5
>  hash_tb LK                   7
> 
> test_linux26: 30198 files; 258 avg. unique lines/file
>  Algorithm                  Max
>  djb2 TRUNC                  32
>  djb2 LK                      7
> 
>  string_hash31 TRUNC         32
>  string_hash31 LK             7
> 
>  hash_t TRUNC                 7
>  hash_t LK                    8
> 
>  hash_tx2 TRUNC               7
>  hash_tx2 LK                  7
> 
>  hash_tb TRUNC                7
>  hash_tb LK                   7
> 
> test_frameworks_base: 8381 files; 184 avg. unique lines/file
>  Algorithm                  Max
>  djb2 TRUNC                  10
>  djb2 LK                      7
> 
>  string_hash31 TRUNC         13
>  string_hash31 LK             6
> 
>  hash_t TRUNC                 7
>  hash_t LK                    7
> 
>  hash_tx2 TRUNC               7
>  hash_tx2 LK                  6
> 
>  hash_tb TRUNC                6
>  hash_tb LK                   6
> 
> 
> 
> -- 
> Shawn.
> _______________________________________________
> jgit-dev mailing list
> jgit-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/jgit-dev


Back to the top