Class HLL

java.lang.Object
org.apache.solr.util.hll.HLL
All Implemented Interfaces:
Cloneable

public class HLL extends Object implements Cloneable
A probabilistic set of hashed long elements. Useful for computing the approximate cardinality of a stream of data in very small storage.

A modified version of the 'HyperLogLog' data structure and algorithm is used, which combines both probabilistic and non-probabilistic techniques to improve the accuracy and storage requirements of the original algorithm.

More specifically, initializing and storing a new HLL will allocate a sentinel value symbolizing the empty set (HLLType.EMPTY). After adding the first few values, a sorted list of unique integers is stored in a HLLType.EXPLICIT hash set. When configured, accuracy can be sacrificed for memory footprint: the values in the sorted list are "promoted" to a "HLLType.SPARSE" map-based HyperLogLog structure. Finally, when enough registers are set, the map-based HLL will be converted to a bit-packed "HLLType.FULL" HyperLogLog structure.

This data structure is interoperable with the implementations found at:

when properly serialized.
  • Field Details

  • Constructor Details

    • HLL

      public HLL(int log2m, int regwidth, int expthresh, boolean sparseon, HLLType type)
      NOTE: Arguments here are named and structured identically to those in the PostgreSQL implementation, which can be found here.
      Parameters:
      log2m - log-base-2 of the number of registers used in the HyperLogLog algorithm. Must be at least 4 and at most 30.
      regwidth - number of bits used per register in the HyperLogLog algorithm. Must be at least 1 and at most 8.
      expthresh - tunes when the HLLType.EXPLICIT to HLLType.SPARSE promotion occurs, based on the set's cardinality. Must be at least -1 and at most 18.
      sparseon - Flag indicating if the HLLType.SPARSE representation should be used.
      type - the type in the promotion hierarchy which this instance should start at. This cannot be null.
    • HLL

      public HLL(int log2m, int regwidth)
      Construct an empty HLL with the given log2m and regwidth.

      This is equivalent to calling HLL(log2m, regwidth, -1, true, HLLType.EMPTY).

      Parameters:
      log2m - log-base-2 of the number of registers used in the HyperLogLog algorithm. Must be at least 4 and at most 30.
      regwidth - number of bits used per register in the HyperLogLog algorithm. Must be at least 1 and at most 8.
      See Also:
  • Method Details

    • getType

      public HLLType getType()
      Returns:
      the type in the promotion hierarchy of this instance. This will never be null .
    • addRaw

      public void addRaw(long rawValue)
      Adds rawValue directly to the HLL.
      Parameters:
      rawValue - the value to be added. It is very important that this value already be hashed with a strong (but not necessarily cryptographic) hash function. For instance, the Murmur3 implementation in Google's Guava library is an excellent hash function for this purpose and, for seeds greater than zero, matches the output of the hash provided in the PostgreSQL implementation.
    • cardinality

      public long cardinality()
      Computes the cardinality of the HLL.
      Returns:
      the cardinality of HLL. This will never be negative.
    • clear

      public void clear()
      Clears the HLL. The HLL will have cardinality zero and will act as if no elements have been added.

      NOTE: Unlike addRaw(long), clear does NOT handle transitions between HLLTypes - a probabilistic type will remain probabilistic after being cleared.

    • union

      public void union(HLL other)
      Computes the union of HLLs and stores the result in this instance.
      Parameters:
      other - the other HLL instance to union into this one. This cannot be null .
    • toBytes

      public byte[] toBytes()
      Serializes the HLL to an array of bytes in correspondence with the format of the default schema version, SerializationUtil.DEFAULT_SCHEMA_VERSION.
      Returns:
      the array of bytes representing the HLL. This will never be null or empty.
    • toBytes

      public byte[] toBytes(ISchemaVersion schemaVersion)
      Serializes the HLL to an array of bytes in correspondence with the format of the specified schema version.
      Parameters:
      schemaVersion - the schema version dictating the serialization format
      Returns:
      the array of bytes representing the HLL. This will never be null or empty.
    • fromBytes

      public static HLL fromBytes(byte[] bytes)
      Deserializes the HLL (in toBytes(ISchemaVersion) format) serialized into bytes .
      Parameters:
      bytes - the serialized bytes of new HLL
      Returns:
      the deserialized HLL. This will never be null.
      See Also:
    • clone

      public HLL clone() throws CloneNotSupportedException
      Create a deep copy of this HLL.
      Overrides:
      clone in class Object
      Throws:
      CloneNotSupportedException
      See Also: