Class Hash
- java.lang.Object
-
- org.apache.solr.common.util.Hash
-
public class Hash extends Object
Fast, well distributed, cross-platform hash functions.Development background: I was surprised to discovered that there isn't a good cross-platform hash function defined for strings. MD5, SHA, FVN, etc, all define hash functions over bytes, meaning that it's under-specified for strings.
So I set out to create a standard 32 bit string hash that would be well defined for implementation in all languages, have very high performance, and have very good hash properties such as distribution. After evaluating all the options, I settled on using Bob Jenkins' lookup3 as a base. It's a well studied and very fast hash function, and the hashword variant can work with 32 bits at a time (perfect for hashing unicode code points). It's also even faster on the latest JVMs which can translate pairs of shifts into native rotate instructions.
The only problem with using lookup3 hashword is that it includes a length in the initial value. This would suck some performance out since directly hashing a UTF8 or UTF16 string (Java) would require a pre-scan to get the actual number of unicode code points. The solution was to simply remove the length factor, which is equivalent to biasing initVal by -(numCodePoints*4). This slightly modified lookup3 I define as lookup3ycs.
So the definition of the cross-platform string hash lookup3ycs is as follows:
The hash value of a character sequence (a string) is defined to be the hash of its unicode code points, according to lookup3 hashword, with the initval biased by -(length*4).
So by definition
lookup3ycs(k,offset,length,initval) == lookup3(k,offset,length,initval-(length*4)) AND lookup3ycs(k,offset,length,initval+(length*4)) == lookup3(k,offset,length,initval)
An obvious advantage of this relationship is that you can use lookup3 if you don't have an implementation of lookup3ycs.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Hash.LongPair
128 bits of state
-
Constructor Summary
Constructors Constructor Description Hash()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static int
fmix32(int h)
static long
fmix64(long k)
static long
getLongLittleEndian(byte[] buf, int offset)
Gets a long from a byte buffer in little endian byte order.static int
lookup3(int[] k, int offset, int length, int initval)
A Java implementation of hashword from lookup3.c by Bob Jenkins (original source).static int
lookup3ycs(int[] k, int offset, int length, int initval)
Identical to lookup3, except initval is biased by -(length<<2).static int
lookup3ycs(CharSequence s, int start, int end, int initval)
The hash value of a character sequence is defined to be the hash of its unicode code points, according tolookup3ycs(int[] k, int offset, int length, int initval)
static long
lookup3ycs64(CharSequence s, int start, int end, long initval)
This is the 64 bit version of lookup3ycs, corresponding to Bob Jenkin's lookup3 hashlittle2 with initval biased by -(numCodePoints<<2).static void
murmurhash3_x64_128(byte[] key, int offset, int len, int seed, Hash.LongPair out)
Returns the MurmurHash3_x64_128 hash, placing the result in "out".static int
murmurhash3_x86_32(byte[] data, int offset, int len, int seed)
Returns the MurmurHash3_x86_32 hash.static int
murmurhash3_x86_32(CharSequence data, int offset, int len, int seed)
Returns the MurmurHash3_x86_32 hash of the UTF-8 bytes of the String without actually encoding the string to a temporary buffer.
-
-
-
Method Detail
-
lookup3
public static int lookup3(int[] k, int offset, int length, int initval)
A Java implementation of hashword from lookup3.c by Bob Jenkins (original source).- Parameters:
k
- the key to hashoffset
- offset of the start of the keylength
- length of the keyinitval
- initial value to fold into the hash- Returns:
- the 32 bit hash code
-
lookup3ycs
public static int lookup3ycs(int[] k, int offset, int length, int initval)
Identical to lookup3, except initval is biased by -(length<<2). This is equivalent to leaving out the length factor in the initial state.lookup3ycs(k,offset,length,initval) == lookup3(k,offset,length,initval-(length<<2))
andlookup3ycs(k,offset,length,initval+(length<<2)) == lookup3(k,offset,length,initval)
-
lookup3ycs
public static int lookup3ycs(CharSequence s, int start, int end, int initval)
The hash value of a character sequence is defined to be the hash of its unicode code points, according tolookup3ycs(int[] k, int offset, int length, int initval)
If you know the number of code points in the
CharSequence
, you can generate the same hash as the original lookup3 vialookup3ycs(s, start, end, initval+(numCodePoints<<2))
-
lookup3ycs64
public static long lookup3ycs64(CharSequence s, int start, int end, long initval)
This is the 64 bit version of lookup3ycs, corresponding to Bob Jenkin's lookup3 hashlittle2 with initval biased by -(numCodePoints<<2). It is equivalent to lookup3ycs in that if the high bits of initval==0, then the low bits of the result will be the same as lookup3ycs.
-
murmurhash3_x86_32
public static int murmurhash3_x86_32(byte[] data, int offset, int len, int seed)
Returns the MurmurHash3_x86_32 hash. Original source/tests at https://github.com/yonik/java_util/
-
murmurhash3_x86_32
public static int murmurhash3_x86_32(CharSequence data, int offset, int len, int seed)
Returns the MurmurHash3_x86_32 hash of the UTF-8 bytes of the String without actually encoding the string to a temporary buffer. This is more than 2x faster than hashing the result of String.getBytes().
-
fmix32
public static final int fmix32(int h)
-
fmix64
public static final long fmix64(long k)
-
getLongLittleEndian
public static final long getLongLittleEndian(byte[] buf, int offset)
Gets a long from a byte buffer in little endian byte order.
-
murmurhash3_x64_128
public static void murmurhash3_x64_128(byte[] key, int offset, int len, int seed, Hash.LongPair out)
Returns the MurmurHash3_x64_128 hash, placing the result in "out".
-
-