public class HLL extends Object implements Cloneable
long
elements. Useful for computing
the approximate cardinality of a stream of data in very small storage.
A modified version of the
'HyperLogLog' data structure and algorithm is used, which combines both
probabilistic and non-probabilistic techniques to improve the accuracy and
storage requirements of the original algorithm.
More specifically, initializing and storing a new HLL
will
allocate a sentinel value symbolizing the empty set (HLLType.EMPTY
).
After adding the first few values, a sorted list of unique integers is
stored in a HLLType.EXPLICIT
hash set. When configured, accuracy can
be sacrificed for memory footprint: the values in the sorted list are
"promoted" to a "HLLType.SPARSE
" map-based HyperLogLog structure.
Finally, when enough registers are set, the map-based HLL will be converted
to a bit-packed "HLLType.FULL
" HyperLogLog structure.
This data structure is interoperable with the implementations found at:
Modifier and Type | Field and Description |
---|---|
static int |
MAXIMUM_EXPLICIT_THRESHOLD |
static int |
MAXIMUM_EXPTHRESH_PARAM |
static int |
MAXIMUM_LOG2M_PARAM |
static int |
MAXIMUM_REGWIDTH_PARAM |
static int |
MINIMUM_EXPTHRESH_PARAM |
static int |
MINIMUM_LOG2M_PARAM |
static int |
MINIMUM_REGWIDTH_PARAM |
Constructor and Description |
---|
HLL(int log2m,
int regwidth)
Construct an empty HLL with the given
log2m and regwidth . |
HLL(int log2m,
int regwidth,
int expthresh,
boolean sparseon,
HLLType type)
NOTE: Arguments here are named and structured identically to those in the
PostgreSQL implementation, which can be found
here.
|
Modifier and Type | Method and Description |
---|---|
void |
addRaw(long rawValue)
Adds
rawValue directly to the HLL. |
long |
cardinality()
Computes the cardinality of the HLL.
|
void |
clear()
Clears the HLL.
|
HLL |
clone()
Create a deep copy of this HLL.
|
static HLL |
fromBytes(byte[] bytes)
Deserializes the HLL (in
toBytes(ISchemaVersion) format) serialized
into bytes . |
HLLType |
getType() |
byte[] |
toBytes()
Serializes the HLL to an array of bytes in correspondence with the format
of the default schema version,
SerializationUtil.DEFAULT_SCHEMA_VERSION . |
byte[] |
toBytes(ISchemaVersion schemaVersion)
Serializes the HLL to an array of bytes in correspondence with the format
of the specified schema version.
|
void |
union(HLL other)
Computes the union of HLLs and stores the result in this instance.
|
public static final int MINIMUM_LOG2M_PARAM
public static final int MAXIMUM_LOG2M_PARAM
public static final int MINIMUM_REGWIDTH_PARAM
public static final int MAXIMUM_REGWIDTH_PARAM
public static final int MINIMUM_EXPTHRESH_PARAM
public static final int MAXIMUM_EXPTHRESH_PARAM
public static final int MAXIMUM_EXPLICIT_THRESHOLD
public HLL(int log2m, int regwidth, int expthresh, boolean sparseon, HLLType type)
log2m
- log-base-2 of the number of registers used in the HyperLogLog
algorithm. Must be at least 4 and at most 30.regwidth
- number of bits used per register in the HyperLogLog
algorithm. Must be at least 1 and at most 8.expthresh
- tunes when the HLLType.EXPLICIT
to
HLLType.SPARSE
promotion occurs,
based on the set's cardinality. Must be at least -1 and at most 18.sparseon
- Flag indicating if the HLLType.SPARSE
representation should be used.type
- the type in the promotion hierarchy which this instance should
start at. This cannot be null
.public HLL(int log2m, int regwidth)
log2m
and regwidth
.
This is equivalent to calling HLL(log2m, regwidth, -1, true, HLLType.EMPTY)
.log2m
- log-base-2 of the number of registers used in the HyperLogLog
algorithm. Must be at least 4 and at most 30.regwidth
- number of bits used per register in the HyperLogLog
algorithm. Must be at least 1 and at most 8.HLL(int, int, int, boolean, HLLType)
public HLLType getType()
null
.public void addRaw(long rawValue)
rawValue
directly to the HLL.rawValue
- the value to be added. It is very important that this
value already be hashed with a strong (but not
necessarily cryptographic) hash function. For instance, the
Murmur3 implementation in
Google's Guava library is an excellent hash function for this
purpose and, for seeds greater than zero, matches the output
of the hash provided in the PostgreSQL implementation.public long cardinality()
public void clear()
addRaw(long)
, clear
does NOT handle
transitions between HLLType
s - a probabilistic type will remain
probabilistic after being cleared.public void union(HLL other)
other
- the other HLL
instance to union into this one. This
cannot be null
.public byte[] toBytes()
SerializationUtil.DEFAULT_SCHEMA_VERSION
.null
or empty.public byte[] toBytes(ISchemaVersion schemaVersion)
schemaVersion
- the schema version dictating the serialization formatnull
or empty.public static HLL fromBytes(byte[] bytes)
toBytes(ISchemaVersion)
format) serialized
into bytes
.bytes
- the serialized bytes of new HLLnull
.toBytes(ISchemaVersion)
public HLL clone() throws CloneNotSupportedException
clone
in class Object
CloneNotSupportedException
Object.clone()
Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.